Practical performance tuning as never ending journey to widen knowledge

Golden age of programmers who were able to fit in tiny RAM of first gaming consoles the whole universes of legendary games had passed few decades ago. Now your favourite browser can easily swallow gigabytes of memory in order to render single web-page with myriads of annoying ads that ad-blockers trying to defeat. Relative abundance of computing power bring to programmer’s community privilege of not knowing what is happening under the hood of their beloved frameworks and focus more on business domain. Convoluted web of multi-hierarchical abstractions, advanced garbage collection, ready to plug libraries polished by industry leaders, infrastructure “à la carte” in clouds, diversity of languages with various flavours of syntactic sugar – everything is tuned towards holy aim of decreasing time to market. Need more power? Vertical scaling in many cases is the cheapest and fastest way to squeeze more operations per second. But what if we still need more?

By no means it is not comprehensive guide or blue prints for performance tuning, just couple of thoughts and ideas to inspire thinking out of the box and broaden your horizons – from my perspective – the most crucial skills that is necessary to tackle performance issues.

Lets talk performance and constraints!

For warm up lets start from simple and artificial task that I recently read from Programming Pearls. Apart  brilliant pragmatic ideas for software development this book contains number of exercises for fun and self-education, one of them can be used as a great illustration of several important aspects that we should take into account when we talking about performance.

Problem statement: Imagine that you have a file where every line contains integer number. Numbers are not sorted. We know for sure that file should contain all numbers in range [INT_MIN, INT_MAX] except exactly one. And our task is to find those missing number.

  • INPUT: file with INT_MAX + INT_MIN numbers
  • OUTPUT: int, missing number

Sounds simple, right? Just sort and play with binary search.

  • Runtime complexity: O(N Log N) sorting + O(Log N) bin search
  • Space: O(N)

Lets say we rely on c++ and use old good(?) x86 architecture where each int have size of 4 bytes and total number of unique numbers is a bit above 4 billions – 4,294,967,295. Supposedly we know in advance all black magic happening behind memory allocation on our system and can do it properly, but without going too crazy. If we want to read all numbers in memory it become costly – just for numbers only, without any overhead it will require over 16 GB of RAM. This looks a bit discouraging.

Orkay, we are very experienced developers and know about out of core algorithms – merge sort, for example, can help if we ready for several iterations of loading chunk of records, sort them and save into temporary files. But what to do next? Hmm, we can merge them later into single file that would contains all sorted numbers. We know exactly the whole range so we can iterate over file with sorted numbers to compare line number with actual number (with necessary adjustments for negative integers in first half of entries). Lets say we can afford 5Gb of RAM, in this case we need 4 passes to sort numbers in chunks, we can merge them in linear time and after that sequentially read the whole file. In theory it is still

O(N Log N) for sorting + O(N) for merging + O(N) for sequential search.

But if we talk about real performance – in this case our big O asymptotic will be heavily smashed with reality of multiple heavy I\O operations. Due to memory constraints we most likely do not have spare RAM disks available. For sure, we know those number that every programmer should know. Also, we aware why OSes not so fast when working with block devices – several obvious parts of this equation: actual filesystem and chosen device type. Lets assume we use modern operating system where buffered I\O available behind fwrite/ fstream interfaces.

Would it be even better to use binary search with fseek or jumping through mmaped file? But it expect offset in bytes and we have line number? Probably we can figure out proper formula to adjust offset value given line number and additionally analyse whether previous symbol is carriage return? Or even better use binary format to save intermediate files with fixed size of every record – equal to 4 bytes? Should we stick with more tailored methods like interpolation search – as our key are numbers?

What if we do not have 5Gb? And our hard limit is around 1mb? Sorting and merging chunks become crazy slow.  But do we actually need to sort full array? What if we partition our data using additional files as buckets – i.e. if entry less than current pivot – add to left file otherwise to right? And at the next iteration work only with smaller file? For pivot we will choose median element of current range and do not add it to any file to deal with odd total number of elements. Still noticeable number of I\O though – huge factor that break all asymptotics with harsh reality of long operations.

Let’s start thinking over again – how we can define those numbers without enumerating them all?

Lets recap:

  • we are dealing with huge number of distinct integers in non-sorted sequence
  • we do not need to preserve original order
  • we do not have any a priory knowledge about expected permutation. If we say that distance is absolute value of difference, between line’s number and value of integer residing in that line, is there any particular distribution of distances or the whole sorted array just shifted a bit?

On the other hand it is only 11 distinct symbols: 10 digits and optional sign symbol, that form our alphabet for representation of words – numbers. It can be defined as some sort of regular grammar. We also have boundaries of possible values, which make definition of corresponding regex a bit less elegant, moreover it doesn’t help us to identify missing entry.

Once again, we can’t fit array in memory, can we? Lets re-phrase – we can’t fit all numbers in memory using built-in language’s datatypes. What about more compact representation? Protobuf use variable length encoding – that can decrease size of small integers – i.e. we do not need the whole 4 bytes for something that can fit in single byte – not too helpful in our case. Should we check algorithms of lossless compression? Naive RLE based approach will be more memory hungry if we use std::string that may have compiler specific memory overhead, it is not as severe as in jvm based languages, especially pre-1.8, but still noticeable. Given our fixed range of numbers – percentage of entries with 3 or more repeated digits are less than 10% – not so high to justify string overhead. What about more advanced methods? Deflate may be good general purpose algorithm, but we can try to use something tailored specifically for integers! This implementation, for example, promise that it can decrease requirements from 4 bytes to up to 5 bits per number (lets forget for a moment that it require most numbers within array to be small). Even if it works for arbitrary integers it is still requires above 2.68 GB + overhead to compress\decompress. Additionally, compression usually are not stream friendly – i.e. we can’t provide complete array as input and have to read data in chunks, feed content of buffer into compression routine in batches which in turn make compression less efficient. At the end it would not be easy to iterate through compressed array as well as random access by index will not be available. Seems not be very practical in our case.

If we recap low level representation of integers – there are 4 bytes per number with special care for sign. What if we use similar notation for depicting which number is present? Imagine long string of bits, where i-th bit is set if number i is present within input array – so we can read number by number from file, set corresponding bits and later check our bit string to find position of unset bit. This can severely relax our memory requirements – we would need (approximately and implementation depended) – 4 * ((string length + 31) / 32) bytes ~ 500+ MB. It is still big, moreover if we try to use std::bitset based on helloworld like examples, we most likely end up with seg fault with this “innocent” line, even if we have abundance of RAM:

std::bitset <100000000> w;

Why? Generally, memory of computer will be split between kernel and user space, and structure of memory of particular process residing in user space depend on type of executable. Within process’s memory, dedicated portion will be allocated during startup for stack to keep return addresses of functions (within call stack, during program execution) and other stuff like local variables, arguments, etc. Size of this area usually restricted by OS – limits of process’s stack size. Okay, we will allocate it on heap. What about asymptotic assessment? In theory it should be O(N) runtime complexity and still O(N) space, but in practice we were able to decrease hidden coefficient to be small enough to significantly reduce actual size.

But even with our runtime complexity it is also not so straightforward as you may think. Let’s forget for a moment about overhead of reading data from file and converting string to integers, suppose we have decent workstation that can fit everything into RAM without hiccups and start from there. What we are doing is actually a bitmap sorting that indeed have linear runtime complexity. But what about actual performance in comparison to comparative algorithms – it seems that memory hierarchy can still hit us in terms of real performance due to patterns of accessing memory that lead to cache misses and not utilising branch predictions, closing gaps between theoretical complexity and actual execution time.

All great, but 500GB is still too much for us. Hm, what if we have all numbers within our boundaries? Then we easily just use two number to reflect those range: [INT_MIN, INT_MAX].  And if exactly one number is missing – we will need just one more variable to reflect it: [range_start, range_end], missing. Now, what about finding that missing number? What if we sum all numbers within range and subtract from it actual sum of numbers from the file? Runtime complexity is linear – one pass over range + summation of all numbers from file and just two auxiliary variables to store result of two sums – i.e. finally O(1) – constant! But, yeah, those variables… Which type that should be? If entries in files happen to be in this order – [INT_MAX, INT_MAX-1, … , ] and we try to sum those two first what we will get? In this particular case relying on long long int with width of 64 bits should be sufficient to avoid overflow. But what we will do if we have to deal with bigger numbers? In this case we either can stick with compiler specific extensions with higher capacity or include into our tiny project libraries that have types which meet our requirements.

Alternatively, what about utilisation of some bit twiddling hacks for the great good? If we XOR number with itself – we will get 0 – i.e. they cancel each other. If we have sequence of number in range  [-10,10] XOR all of them first, and after that try to xor with all numbers in this exact range except chosen one – we will get as a result exactly our missing number! Literally just one pass over range and O(N) for read all numbers from file and O(1) memory – we do not need even two variables, one would be enough! XOR should be even faster than sum operation and no need to care about overflow and related complexity, does it?

Her majesty math suggest that particular kind of sequence may have some handy equations to compute sum: in our case it will be even simpler as we operate from negative to positive extremums – it should be just INT_MIN (-2,147,483,648) – so we can even save N operations for precomputing sum completely!

Happiness, prosperity, work done!

Suddenly very enthusiastic sale manager appear close to our monitor and cheerfully shared great news – he was able to sold our wonderful software, but client asked one tiny change – have at most two numbers missing. His peer mention that he also close to make a deal but it is required to have generalised solution for k-missing numbers. As we already started with some naive equation with sum approach – we can dive in some text books to find out that it is classical case for system equations

In the example above we not just fight with some abstract performance limitation: we try to decrease overall duration of our task – aka wall time by tackling memory constraints that prevent us to use brute force approach. From this warm up you can briefly figure out diversity of related topics (far from complete) and numbers of factors that may affect performance of complex system.

Stories time

Now let me share few war stories. All of them from different domains and tech stacks but common theme that unite all of them – during initial efforts to speed up things people tend to pull wrong levers.

 
int getRandomIntInRange(int MIN, int MAX) { 
         int i;
         do { 
            i = Random.nextInt();
         } while(i < MIN && i > MAX) 
         return i;
} 

Ages ago I was involved in development of system that monitored various metrics of devices in private networks where start point for inventory was address of single router. Initially it support only Sparc Solaris and built around libthread, latter it was re-targeted for a wider audience of posix compatible linux distributions (both x86 and x86_64). There were a lot of technicalities involving maintaining various locking primitives: non-blocking read write locks, spin locks, condition variables and barriers as well as handling management requests through sigaction interface. Part of functionality was to poll various metrics and service information from newly discovered devices using SNMP protocol. This particular component was main target of various concerns in terms of performance. Initial vision was to develop dedicated thread pool that maintain task’s queue with details of devices, their metadata and corresponding metrics to gather. 4 months of development later we end up with complex system that perform around 7% faster than original, much simpler version. So finally full scale profiling was initiated.  Apart from high number of interrupts and context switches, I’ve stumbled across tiny method that assembled SNMP requests.  It was using exclusively plain snmp get, issuing new request for every OID – parameter to retrieve. There are 3 versions of SNMP protocols – each of them bring new features. What was handy for us – 2nd version of snmp introduce bulk get requests allowed to retrieve information about several oids within single inquiry. Module was extended to first check whether device support SNMP v2 and if yes utilise bulk requests otherwise fallback on simple snmp get. It decrease speed of processing up to several times (depending on number of metrics to be retrieved).

    void doSomethingVeryUseful() {
        /*prepared statement binding parameters*/

        ResultSet rs = dbConnection.execute(statement);
        Bla bla = Null;
        for(Row r : rs.all()) { // fetch everything 
            bla = new Bla(r); 
            break; //get only first 
        }

        /*some processing of first entry of bla*/
    }

During yet another fearless journey in startup word I was trying to speedup some computer vision pipelines triggered on every new frame. Not sure that at that time I was familiar with formal concept of Amdahl’s law – but intuitively it was clear – no mater how many gpus you will throw at your task – single consolidation point will result in idle of all your compute resources in case of strong branch divergence. A lot of efforts were put into gpu kernel tuning – review of various novel methods from ICRA and other conferences, digging into gpu threading model internals (warp vs wavefront, memory coalescing, shared memory bank conflicts) and gpu computing patterns. Harsh truth was that most delays were occurred due to multiple coping from host memory to device (GPU) memory and back instead of defining complete pipeline (stacked sequence of kernels to be executed on GPU) and minimise data transfers as well as using pinned memory and DMA transfer to speed up data movement.

List <Something> getTopRows(int limit) {
    /* go into db and run select all */
    List<> list = rs.all();
    while (list.size() > limit) { 
        list.remove(list.size() - 1); 
    }
    
    return list;
}

Over-excited young performance tuner may think about some esoteric algorithms or data structures but in many cases sad truth is just code quality are not so high – redundant abstractions, abundance of dead code hanging in ram, relying on anti-patterns for particular language or technology or just bringing code of prototype into production.

Another cautionary tale was happening with yet another company that tasked me to speed up method that backed one of core REST endpoint preventing smooth loading of main web-page and mobile apps. Initial suspicious was it happening due to lack of proper indexes in postgres, but I was already not so naive so started with profiling. From my observation every requests lead to burst of reads from disk, followed by high cpu usage. Initial idea was to simply offload it into redis using cache aside strategy with forced warm up during startup. Diving into code to find out what was actually happening:

  • we read some text data with attached labels and weights from db
  • we pre-process and build feature vector for every piece of text
  • the whole ML model was re-built based on those intermediate data representation
  • payload of request were pre-processed and forwarded to that model
  • model assign priorities to list of some goods
  • backend return top k entries from those list to the client and result set have to be maintained for server side paging

Further discussion with stakeholders lead to conclusion that model actually can be re-builded once a week i.e. it is fine if we lose some precision because didn’t take into account entries updated in last days. So we end up with additional serving layer for model residing in redis, that were rebuilt once a week in off-time.

Based on analysis of users behavior – majority doesn’t click through too many pages – first 7 pages of result set was cached as well using TLRU as cache eviction policy. For situation when user want to see results at page 8 or further – we will issue a new search request.

yaroslav [12:37 PM]
Dmitry, the situation makes me a little nervous. 
Have any guess about revision in which tests were passed well?
None master, none "some test fixes".
It's just insane.

One of my favorite tasks are related to slow databases. Cross-joins, db or table lock, poor choice of partition keys, building aggregates for ancient immutable data every time we do reports, flirting with ORMs that tend to increase complexity of code, using redundant precisions for cases when only three distinct values are possible – there are rich arsenal of methods to kill performance of any db whether it classical rdbms or non-sql, with commercial supports of open source.

One company complains that during particular report generation there were observed timeouts, it is very slow and it seems that it even affect overall performance of db cluster. There were common beliefs that database have to be tuned for better performance, and, if it required – shiny new servers with corresponding efforts to migrate all data can be considered as option. During profiling network utilisation at client’s host, I’ve witnessed suspiciously high network i\o as well as jumps of heap usage. Browsing code I’ve found out that we try to retrieve data points using seconds as key, but within time dimension in db all data are stored within 30 minutes buckets as keys, every bucket contains around 100k rows. In case of several items fall into the same interval we just issue independent queries, ask single node to return hundreds thousands of entries. When we retrieve rows – we do filtering programmatically throw away majority of data. Schema adjustment (and data migration) bring peace and prosperity to that module.

But another component of system require attention – I was told that during write path it seems we reach peak performance as well. I’ve looked at code – at first glance consumer’s threads read data from kafka, do various data enrichment activities and synchronously write entries one by one (i.e. no bulk\batch write). At some moment no matter how many more consumer threads were added – rate of appended rows was more or less the same. However db doesn’t show any signs of overload. Probably driver includes client side load-balancing in order to back off rate of operations when cluster is struggling? Maybe host where consumers are running can’t withstand bigger load? Nope, it turn out, apart from actual duties, every worker also gathered some statistics, from time to time it was aggregated from all threads for further reporting in “lets-lock-everything” fashion.

[2000-00-00 00:00:00,000] WARN Session 0x15ab9f8e3e00000 for server 1.1.1.1/1.1.1.1:2181,
 unexpected error, closing socket connection and attempting reconnect
(org.apache.zookeeper.ClientCnxn) java.lang.OutOfMemoryError: Metaspace
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred
 dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
[2000-00-00 00:00:00,000] INFO Partition [some_topic_name_20,14] on broker 0: No checkpointed
highwatermark is found for partition [some_topic_name_20,14] (kafka.cluster.Partition)
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f8a2c580000, 
262144, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 262144 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /opt/kafka_2.11-0.9.0.1/hs_err_pid68543.log

And the last one, recently I’ve been playing with ETLs for spark that were under suspect of slowdown due to shuffling, I’ve stumble upon small method for date formatting, later wrapped as user defined function – UDF. ETL and udf was written in python, spark in scala i.e. JVM based. Udf by itself doesn’t considered to be performant beasts, but in this particular case in order to apply it – every row have to be copied from java process to python with all related overhead of deserialisation.

// No chance I would be able to translate it to english in order to reflect pain
// TLDR; one of my peers find out byte-by-byte comparison of images
yaroslav[12:05 AM]
Маму! Карму! Бога! Душу!
yaroslav[12:05 AM]
Ты знаешь как твой талант сравнивает пнгшки?
yaroslav[12:06 AM]
Он сравнивает размер. Не на равно, а на отношени 1 +- 0.025. А затем сравнивает ПОБАЙТОВУЮ разницу контента к длинне с 1+- 0.25.
yaroslav[12:07 AM]
Встретишь пожми ему за меня горло.

Common observations – people usually didn’t invest too much time into learning and understanding technologies that they try to use and during testing it usually went through happy path. There are various examples of choosing right tool to do the job or being influenced by hype without thorough stress testing (not to mention good old motto of not updating anything at production without absolute need). Or it is just matter of using best language?

My philosophy is to absorb more details how things actually work throughout the whole stack and in order to deal with dead ends – read even more 🙂

There are several absolutely brilliant publications related to various aspects of performance tuning:

 

Few more war stories for inspiration towards evolution of idea how to speed up things:

2 comments

  1. I’m generally a big fan of decimal GB (note that Gb means gigabit, GB means gigabyte) but in this case base-2 is much simpler. Instead of 17.18 GB of RAM you can just say 16.0 GiB of RAM. It’s much tidier because you are dealing with an exact power of two.

      • admin on March 5, 2020 at 12:53 am
        Author

      Didn’t know about such difference related to notation, thank you!
      Will update during next iteration of editing 🙂

Leave a Reply

Your email address will not be published.