Practical performance tuning as never ending journey to widen knowledge

Golden age of programmers who were able to fit in tiny RAM of first gaming consoles the whole universes of legendary games had passed few decades ago. Now your favourite browser can easily swallow gigabytes of memory in order to render single web-page with myriads of annoying ads that ad-blockers trying to defeat. Relative abundance of computing power bring to programmer’s community privilege of not knowing what is happening under the hood of their beloved frameworks and focus more on business domain. Convoluted web of multi-hierarchical abstractions, advanced garbage collection, ready to plug libraries polished by industry leaders, infrastructure “à la carte” in clouds, diversity of languages with various flavours of syntactic sugar – everything is tuned towards holy aim of decreasing time to market. Need more power? Vertical scaling in many cases is the cheapest and fastest way to squeeze more operations per second. But what if we still need more?

By no means it is not comprehensive guide or blue prints for performance tuning, just couple of thoughts and ideas to inspire thinking out of the box and broaden your horizons – from my perspective – the most crucial skills that is necessary to tackle performance issues.

Lets talk performance and constraints!

For warm up lets start from simple and artificial task that I recently read from Programming Pearls. Apart  brilliant pragmatic ideas for software development this book contains number of exercises for fun and self-education, one of them can be used as a great illustration of several important aspects that we should take into account when we talking about performance.

Problem statement: Imagine that you have a file where every line contains integer number. Numbers are not sorted. We know for sure that file should contain all numbers in range [INT_MIN, INT_MAX] except exactly one. And our task is to find those missing number.

  • INPUT: file with INT_MAX + INT_MIN numbers
  • OUTPUT: int, missing number

Sounds simple, right? Just sort and play with binary search.

  • Runtime complexity: O(N Log N) sorting + O(Log N) bin search
  • Space: O(N)

Lets say we rely on c++ and use old good(?) x86 architecture where each int have size of 4 bytes and total number of unique numbers is a bit above 4 billions – 4,294,967,295. Supposedly we know in advance all black magic happening behind memory allocation on our system and can do it properly, but without going too crazy. If we want to read all numbers in memory it become costly – just for numbers only, without any overhead it will require over 16 GB of RAM. This looks a bit discouraging.

Orkay, we are very experienced developers and know about out of core algorithms – merge sort, for example, can help if we ready for several iterations of loading chunk of records, sort them and save into temporary files. But what to do next? Hmm, we can merge them later into single file that would contains all sorted numbers. We know exactly the whole range so we can iterate over file with sorted numbers to compare line number with actual number (with necessary adjustments for negative integers in first half of entries). Lets say we can afford 5Gb of RAM, in this case we need 4 passes to sort numbers in chunks, we can merge them in linear time and after that sequentially read the whole file. In theory it is still

O(N Log N) for sorting + O(N) for merging + O(N) for sequential search.

But if we talk about real performance – in this case our big O asymptotic will be heavily smashed with reality of multiple heavy I\O operations. Due to memory constraints we most likely do not have spare RAM disks available. For sure, we know those number that every programmer should know. Also, we aware why OSes not so fast when working with block devices – several obvious parts of this equation: actual filesystem and chosen device type. Lets assume we use modern operating system where buffered I\O available behind fwrite/ fstream interfaces.

Would it be even better to use binary search with fseek or jumping through mmaped file? But it expect offset in bytes and we have line number? Probably we can figure out proper formula to adjust offset value given line number and additionally analyse whether previous symbol is carriage return? Or even better use binary format to save intermediate files with fixed size of every record – equal to 4 bytes? Should we stick with more tailored methods like interpolation search – as our key are numbers?

What if we do not have 5Gb? And our hard limit is around 1mb? Sorting and merging chunks become crazy slow.  But do we actually need to sort full array? What if we partition our data using additional files as buckets – i.e. if entry less than current pivot – add to left file otherwise to right? And at the next iteration work only with smaller file? For pivot we will choose median element of current range and do not add it to any file to deal with odd total number of elements. Still noticeable number of I\O though – huge factor that break all asymptotics with harsh reality of long operations.

Let’s start thinking over again – how we can define those numbers without enumerating them all?

Lets recap:

  • we are dealing with huge number of distinct integers in non-sorted sequence
  • we do not need to preserve original order
  • we do not have any a priory knowledge about expected permutation. If we say that distance is absolute value of difference, between line’s number and value of integer residing in that line, is there any particular distribution of distances or the whole sorted array just shifted a bit?

On the other hand it is only 11 distinct symbols: 10 digits and optional sign symbol, that form our alphabet for representation of words – numbers. It can be defined as some sort of regular grammar. We also have boundaries of possible values, which make definition of corresponding regex a bit less elegant, moreover it doesn’t help us to identify missing entry.

Once again, we can’t fit array in memory, can we? Lets re-phrase – we can’t fit all numbers in memory using built-in language’s datatypes. What about more compact representation? Protobuf use variable length encoding – that can decrease size of small integers – i.e. we do not need the whole 4 bytes for something that can fit in single byte – not too helpful in our case. Should we check algorithms of lossless compression? Naive RLE based approach will be more memory hungry if we use std::string that may have compiler specific memory overhead, it is not as severe as in jvm based languages, especially pre-1.8, but still noticeable. Given our fixed range of numbers – percentage of entries with 3 or more repeated digits are less than 10% – not so high to justify string overhead. What about more advanced methods? Deflate may be good general purpose algorithm, but we can try to use something tailored specifically for integers! This implementation, for example, promise that it can decrease requirements from 4 bytes to up to 5 bits per number (lets forget for a moment that it require most numbers within array to be small). Even if it works for arbitrary integers it is still requires above 2.68 GB + overhead to compress\decompress. Additionally, compression usually are not stream friendly – i.e. we can’t provide complete array as input and have to read data in chunks, feed content of buffer into compression routine in batches which in turn make compression less efficient. At the end it would not be easy to iterate through compressed array as well as random access by index will not be available. Seems not be very practical in our case.

If we recap low level representation of integers – there are 4 bytes per number with special care for sign. What if we use similar notation for depicting which number is present? Imagine long string of bits, where i-th bit is set if number i is present within input array – so we can read number by number from file, set corresponding bits and later check our bit string to find position of unset bit. This can severely relax our memory requirements – we would need (approximately and implementation depended) – 4 * ((string length + 31) / 32) bytes ~ 500+ MB. It is still big, moreover if we try to use std::bitset based on helloworld like examples, we most likely end up with seg fault with this “innocent” line, even if we have abundance of RAM:

std::bitset <100000000> w;

Why? Generally, memory of computer will be split between kernel and user space, and structure of memory of particular process residing in user space depend on type of executable. Within process’s memory, dedicated portion will be allocated during startup for stack to keep return addresses of functions (within call stack, during program execution) and other stuff like local variables, arguments, etc. Size of this area usually restricted by OS – limits of process’s stack size. Okay, we will allocate it on heap. What about asymptotic assessment? In theory it should be O(N) runtime complexity and still O(N) space, but in practice we were able to decrease hidden coefficient to be small enough to significantly reduce actual size.

But even with our runtime complexity it is also not so straightforward as you may think. Let’s forget for a moment about overhead of reading data from file and converting string to integers, suppose we have decent workstation that can fit everything into RAM without hiccups and start from there. What we are doing is actually a bitmap sorting that indeed have linear runtime complexity. But what about actual performance in comparison to comparative algorithms – it seems that memory hierarchy can still hit us in terms of real performance due to patterns of accessing memory that lead to cache misses and not utilising branch predictions, closing gaps between theoretical complexity and actual execution time.

All great, but 500GB is still too much for us. Hm, what if we have all numbers within our boundaries? Then we easily just use two number to reflect those range: [INT_MIN, INT_MAX].  And if exactly one number is missing – we will need just one more variable to reflect it: [range_start, range_end], missing. Now, what about finding that missing number? What if we sum all numbers within range and subtract from it actual sum of numbers from the file? Runtime complexity is linear – one pass over range + summation of all numbers from file and just two auxiliary variables to store result of two sums – i.e. finally O(1) – constant! But, yeah, those variables… Which type that should be? If entries in files happen to be in this order – [INT_MAX, INT_MAX-1, … , ] and we try to sum those two first what we will get? In this particular case relying on long long int with width of 64 bits should be sufficient to avoid overflow. But what we will do if we have to deal with bigger numbers? In this case we either can stick with compiler specific extensions with higher capacity or include into our tiny project libraries that have types which meet our requirements.

Alternatively, what about utilisation of some bit twiddling hacks for the great good? If we XOR number with itself – we will get 0 – i.e. they cancel each other. If we have sequence of number in range  [-10,10] XOR all of them first, and after that try to xor with all numbers in this exact range except chosen one – we will get as a result exactly our missing number! Literally just one pass over range and O(N) for read all numbers from file and O(1) memory – we do not need even two variables, one would be enough! XOR should be even faster than sum operation and no need to care about overflow and related complexity, does it?

Her majesty math suggest that particular kind of sequence may have some handy equations to compute sum: in our case it will be even simpler as we operate from negative to positive extremums – it should be just INT_MIN (-2,147,483,648) – so we can even save N operations for precomputing sum completely!

Happiness, prosperity, work done!

Suddenly very enthusiastic sale manager appear close to our monitor and cheerfully shared great news – he was able to sold our wonderful software, but client asked one tiny change – have at most two numbers missing. His peer mention that he also close to make a deal but it is required to have generalised solution for k-missing numbers. As we already started with some naive equation with sum approach – we can dive in some text books to find out that it is classical case for system equations

In the example above we not just fight with some abstract performance limitation: we try to decrease overall duration of our task – aka wall time by tackling memory constraints that prevent us to use brute force approach. From this warm up you can briefly figure out diversity of related topics (far from complete) and numbers of factors that may affect performance of complex system.

Stories time

Now let me share few war stories. All of them from different domains and tech stacks but common theme that unite all of them – during initial efforts to speed up things people tend to pull wrong levers.

 
int getRandomIntInRange(int MIN, int MAX) { 
         int i;
         do { 
            i = Random.nextInt();
         } while(i < MIN && i > MAX) 
         return i;
} 

Ages ago I was involved in development of system that monitored various metrics of devices in private networks where start point for inventory was address of single router. Initially it support only Sparc Solaris and built around libthread, latter it was re-targeted for a wider audience of posix compatible linux distributions (both x86 and x86_64). There were a lot of technicalities involving maintaining various locking primitives: non-blocking read write locks, spin locks, condition variables and barriers as well as handling management requests through sigaction interface. Part of functionality was to poll various metrics and service information from newly discovered devices using SNMP protocol. This particular component was main target of various concerns in terms of performance. Initial vision was to develop dedicated thread pool that maintain task’s queue with details of devices, their metadata and corresponding metrics to gather. 4 months of development later we end up with complex system that perform around 7% faster than original, much simpler version. So finally full scale profiling was initiated.  Apart from high number of interrupts and context switches, I’ve stumbled across tiny method that assembled SNMP requests.  It was using exclusively plain snmp get, issuing new request for every OID – parameter to retrieve. There are 3 versions of SNMP protocols – each of them bring new features. What was handy for us – 2nd version of snmp introduce bulk get requests allowed to retrieve information about several oids within single inquiry. Module was extended to first check whether device support SNMP v2 and if yes utilise bulk requests otherwise fallback on simple snmp get. It decrease speed of processing up to several times (depending on number of metrics to be retrieved).

    void doSomethingVeryUseful() {
        /*prepared statement binding parameters*/

        ResultSet rs = dbConnection.execute(statement);
        Bla bla = Null;
        for(Row r : rs.all()) { // fetch everything 
            bla = new Bla(r); 
            break; //get only first 
        }

        /*some processing of first entry of bla*/
    }

During yet another fearless journey in startup word I was trying to speedup some computer vision pipelines triggered on every new frame. Not sure that at that time I was familiar with formal concept of Amdahl’s law – but intuitively it was clear – no mater how many gpus you will throw at your task – single consolidation point will result in idle of all your compute resources in case of strong branch divergence. A lot of efforts were put into gpu kernel tuning – review of various novel methods from ICRA and other conferences, digging into gpu threading model internals (warp vs wavefront, memory coalescing, shared memory bank conflicts) and gpu computing patterns. Harsh truth was that most delays were occurred due to multiple coping from host memory to device (GPU) memory and back instead of defining complete pipeline (stacked sequence of kernels to be executed on GPU) and minimise data transfers as well as using pinned memory and DMA transfer to speed up data movement.

List <Something> getTopRows(int limit) {
    /* go into db and run select all */
    List<> list = rs.all();
    while (list.size() > limit) { 
        list.remove(list.size() - 1); 
    }
    
    return list;
}

Over-excited young performance tuner may think about some esoteric algorithms or data structures but in many cases sad truth is just code quality are not so high – redundant abstractions, abundance of dead code hanging in ram, relying on anti-patterns for particular language or technology or just bringing code of prototype into production.

Another cautionary tale was happening with yet another company that tasked me to speed up method that backed one of core REST endpoint preventing smooth loading of main web-page and mobile apps. Initial suspicious was it happening due to lack of proper indexes in postgres, but I was already not so naive so started with profiling. From my observation every requests lead to burst of reads from disk, followed by high cpu usage. Initial idea was to simply offload it into redis using cache aside strategy with forced warm up during startup. Diving into code to find out what was actually happening:

  • we read some text data with attached labels and weights from db
  • we pre-process and build feature vector for every piece of text
  • the whole ML model was re-built based on those intermediate data representation
  • payload of request were pre-processed and forwarded to that model
  • model assign priorities to list of some goods
  • backend return top k entries from those list to the client and result set have to be maintained for server side paging

Further discussion with stakeholders lead to conclusion that model actually can be re-builded once a week i.e. it is fine if we lose some precision because didn’t take into account entries updated in last days. So we end up with additional serving layer for model residing in redis, that were rebuilt once a week in off-time.

Based on analysis of users behavior – majority doesn’t click through too many pages – first 7 pages of result set was cached as well using TLRU as cache eviction policy. For situation when user want to see results at page 8 or further – we will issue a new search request.

yaroslav [12:37 PM]
Dmitry, the situation makes me a little nervous. 
Have any guess about revision in which tests were passed well?
None master, none "some test fixes".
It's just insane.

One of my favorite tasks are related to slow databases. Cross-joins, db or table lock, poor choice of partition keys, building aggregates for ancient immutable data every time we do reports, flirting with ORMs that tend to increase complexity of code, using redundant precisions for cases when only three distinct values are possible – there are rich arsenal of methods to kill performance of any db whether it classical rdbms or non-sql, with commercial supports of open source.

One company complains that during particular report generation there were observed timeouts, it is very slow and it seems that it even affect overall performance of db cluster. There were common beliefs that database have to be tuned for better performance, and, if it required – shiny new servers with corresponding efforts to migrate all data can be considered as option. During profiling network utilisation at client’s host, I’ve witnessed suspiciously high network i\o as well as jumps of heap usage. Browsing code I’ve found out that we try to retrieve data points using seconds as key, but within time dimension in db all data are stored within 30 minutes buckets as keys, every bucket contains around 100k rows. In case of several items fall into the same interval we just issue independent queries, ask single node to return hundreds thousands of entries. When we retrieve rows – we do filtering programmatically throw away majority of data. Schema adjustment (and data migration) bring peace and prosperity to that module.

But another component of system require attention – I was told that during write path it seems we reach peak performance as well. I’ve looked at code – at first glance consumer’s threads read data from kafka, do various data enrichment activities and synchronously write entries one by one (i.e. no bulk\batch write). At some moment no matter how many more consumer threads were added – rate of appended rows was more or less the same. However db doesn’t show any signs of overload. Probably driver includes client side load-balancing in order to back off rate of operations when cluster is struggling? Maybe host where consumers are running can’t withstand bigger load? Nope, it turn out, apart from actual duties, every worker also gathered some statistics, from time to time it was aggregated from all threads for further reporting in “lets-lock-everything” fashion.

[2000-00-00 00:00:00,000] WARN Session 0x15ab9f8e3e00000 for server 1.1.1.1/1.1.1.1:2181,
 unexpected error, closing socket connection and attempting reconnect
(org.apache.zookeeper.ClientCnxn) java.lang.OutOfMemoryError: Metaspace
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred
 dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
[2000-00-00 00:00:00,000] INFO Partition [some_topic_name_20,14] on broker 0: No checkpointed
highwatermark is found for partition [some_topic_name_20,14] (kafka.cluster.Partition)
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f8a2c580000, 
262144, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 262144 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /opt/kafka_2.11-0.9.0.1/hs_err_pid68543.log

And the last one, recently I’ve been playing with ETLs for spark that were under suspect of slowdown due to shuffling, I’ve stumble upon small method for date formatting, later wrapped as user defined function – UDF. ETL and udf was written in python, spark in scala i.e. JVM based. Udf by itself doesn’t considered to be performant beasts, but in this particular case in order to apply it – every row have to be copied from java process to python with all related overhead of deserialisation.

// No chance I would be able to translate it to english in order to reflect pain
// TLDR; one of my peers find out byte-by-byte comparison of images
yaroslav[12:05 AM]
Маму! Карму! Бога! Душу!
yaroslav[12:05 AM]
Ты знаешь как твой талант сравнивает пнгшки?
yaroslav[12:06 AM]
Он сравнивает размер. Не на равно, а на отношени 1 +- 0.025. А затем сравнивает ПОБАЙТОВУЮ разницу контента к длинне с 1+- 0.25.
yaroslav[12:07 AM]
Встретишь пожми ему за меня горло.

Common observations – people usually didn’t invest too much time into learning and understanding technologies that they try to use and during testing it usually went through happy path. There are various examples of choosing right tool to do the job or being influenced by hype without thorough stress testing (not to mention good old motto of not updating anything at production without absolute need). Or it is just matter of using best language?

My philosophy is to absorb more details how things actually work throughout the whole stack and in order to deal with dead ends – read even more 🙂

There are several absolutely brilliant publications related to various aspects of performance tuning:

 

Few more war stories for inspiration towards evolution of idea how to speed up things:

Distributed systems – key concepts

Databases – how they work under the hood? Key takeaways from brilliant book “Designing data intensive application” – to quickly recap core concepts.

DB engines classifications

  • Type of load: OLTP (transaction processing) vs OLAP (data warehousing and analytics)
  • Relational vs NoSQL, document vs columnar, graph vs triple-store (semantic facts storage)
  • Even within NoSQL camp  you can find wide column based db – descendants of Big Table (DynamoDBCassandra, HBase) vs purely key-value store like Riak, Redis.
  • Classical RDBMS vs MPP (massive parallel processing engines) – usually relational as well but will benefit from parallel processing – Impala, Clickhouse, Presto
  • Data access patterns: database\api call – rest or RPC\asynchronous message passing using message queue or actor framework
  • Schema on read vs schema on write, which usually involve topics related to forward and backward schema compatibility – i.e. they not really need to be matched completely.

How data are stored

There are two key approaches for data storage organisation:

  • Based on Log-structured Merge Tree (LSM) engine. Generally, beneficial for range query and friendly for sequential writes that may offer high throughput. Its foundation is SStable (sorted string table) – log-structured storage segment of key-value pairs, where entries are sorted by keys, stored in memory (hence named memtable). Data is organised in some kind of balanced search tree (usually red-black or AVL) and from time to time it flushed to disk to actual SStable files, that later will be merged during compaction. Compaction – dedicated process that merge SStable by combining new values of similar keys (using last update win technic) and marking removed entries with tombstones. Various strategies available: time based vs size tiered vs levelled compaction (less disk space but big write amplification). Detailed overview of implementation LSM-based storage engine in Tarantool DB – here (in Russian language).
  • In B-tree based data engines, segments have fixed size – usually page size – disk read and (over)write one page a time (even if just one field was changed) vs append only approach for LSM-tree. WAL – write ahead log (redo-log), append only, contains every operation on B-tree to protect from any failure during write operation. Using latches – lightweight locks – to deal with concurrent operations.

Generally B-tree fast for read, LSM-tree – fast for write, but, as usual, it depends. 

Write amplification – on single write – multiple actual write operations under the hood in db engine.

Index can be either in a way of key->reference, where reference will be pointing out to the row stored in heap file (handy in presence of multiple secondary indexes – all of them refer to the same reference) or clustered index – where key directly pointed to row without additional level of lookups. Compromise between them – covered indexes – that contains not complete row but some columns. There are special kind of indexes for dedicated data i.e. for multidimensional indexes – R-tree.

Data modelling

De-facto classical approach for data warehouses (centralised data store from multiple sources):

  • Star schema (dimensional modelling): fact table (event) vs dimension (SCD – slowly changed dimensions – who, where, when, etc).
  • Snowflake ~ star but dimensions are sub-divided on sub-dimensions.

Another interesting point of view is system of records (fact, ground truth) usually normalised and derived data (transformed and enriched, non-normalised)

Normal forms – as corner stones of data organisation in RDBMS.

Fact tables are wide (100+ columns) but query usually require just a small subset of all columns. In case of document database the whole document will be read from disk to get just few attributes, OLTP is row oriented – i.e. read the whole row – 100+ columns – parse them, hence, columnar data storage beneficial for analytical kind of usage. Additional advantages are compression (the simplest approach would be to repeat just ids of unique values within column) and vectorised processing (applying predicates to several entries using single cpu instruction). If data within column – entries – would be sorted – retrieval also would be faster, but for second\third order will not help for cases when we need to return complete row. NOTE: Column family based db engine is not the same as column oriented (Cassandra\HBase)!

How data are spread across the nodes:

Replica is a node storing full copy of database, may be leader (master or primary) – accept writes and may accept reads, but usually offload read load to follower (slave), that accept change stream (replication log) from master and reads form clients. Failover – process of handling failure of leader. There are several common approach for organising db:

  • leaderless
  • single leader
  • multi leader (examples are multiple datacenter\client with offline operations\collaborative editing). Common issue is how to resolve write conflicts.

Replication of changes between nodes are done through replication log, which can be either:

  • statement based replication (caveats are operations with side effects – now(), rand(), sequence numbers, triggers)
  • physical replication – coupled with stored engine internals ~ analysis of changes in write ahead log (WAL)
  • logical (row-based) replication

Replication may be synchronous\asynchronous\semi-synchronous. Async operations weakening durability (data persistence guarantee) of writes but increase speed. Different topology – circular, all to all, star. Alternative is to rely on some erasure coding scheme allows lost data to be recovered with lower storage overhead than full replication.

For cases when data is too big for single node partition based data distribution is used – some nodes responsible for portion of data and dataset spread across nodes, no single node that keep everything. Each node may be leader for some partition and follower for others. Db engines quite enthusiastic about how to properly name partition:

  • shard – elastic\mongo
  • tablet – big table\snowflake
  • vnode – cassandra
  • region – HBase
  • vbucket – couchbase

 

The most naive forms of partitioning are:

  • key-range based when entries are routed to particular node based on routing settings defined by ranges of key (0-1000 1st node, 1001-2000 – another one) – key distribution most likely not be uniform
  • hash based (DHT – Distributed Hash Table) where target node is determined by formula  (hash_function(key) mod number_of_node = node_number) – rebalancing become nightmare.

If load is skewed – partitions that contains hot keys (responsible for disproportionately huge volume of records or requests) will experience high load and become hot spot during processing. Common example is time-based partitions.

Consistent hashing extends idea of using hash for routing requests by utilising number of token ranges (hash values) per node. Number of such ranges can be fixed or dynamic – if size exceed some limits, data will be split further and transfer to corresponding nodes (HBase, Mongo).

Secondary indexes in such settings become more complex, it can be either:

  • by document – every node will contain indexes for data with local partitions only, need to do scatter\gather\rank during query 
  • by key (term) – dedicated index contains references to all partitions containing term, downside is that it have to be sharded as well

 

How requests are routed from client to partition that contains data:

  • routing tier – often rely on Zookeeper (Kafka\HBase)
  • any node may be contacted – in this case it become node coordinator that will forward request to node with data, knowledge about partitions are acquired through gossip protocol.

Nightmare of data consistency

In case two or more nodes that accept writes it is open question how to handle writes for the same key:

  • Conflict-free replicated datatypes (two way merge functions)
  • Mergeable persistent data structure (three way merge functions)
  • Operational transformation 

One of key approach for conflict resolution in multi-replica setup without leader – version vector. It contains version number per key, per replica – so it can distinguish whether two operations are concurrent or one operation causally depend on other. Lamport timestamps – used for total ordering – but you can’t say whether they really concurrent or causally dependent – advantage – more compact. 

There are several tricks to improve data consistency:

  • Read-repair – during regular read operation, do read from multiple nodes and update stale value
  • Anti-entropy – background process that check for inconsistency between nodes
  • Sloppy quorums. Quorum when for operations acknowledgement required number of votes from several nodes – usually absolute majority, sloppy quorum when not designated nodes accepting requests.
  • Hinted handoff – loading writes into coordinator node in case target node is not responding, for increasing write availability

But even if we talk about old good single node database various issues may happen:

  • dirty read – one transaction read uncommitted writes of another transactions
  • dirty write – one transaction overwrite (uncommitted) data from another transaction
  • non-repeatable reads – values for the same rows are different during the same transaction, some time called read skew. Reading old-data before commit may be harmful for backups and analytics.
  • phantom reads – during single transaction new rows are added or removed for the same query put it another way – write in one transaction change result of read in another

Transaction – set of reads and\or writes – executed as single operation, if it succeeds – commit all changes, if it fails – abort\rollback. It is the key component for OLTP engines that offer ACID guarantee: 

  • Atomicity – is about handling halfway changes – transactional commits or rollback – all or nothing.
  • Consistency – data invariant always true i.e. db will be in a good state before and after any transactions, depending on application may rely on db’s isolation and atomicity
  • Isolation – handling simultaneous access to the same record – pretending that no one will concurrently accessing database (various forms are present). Note that it is different in comparison to single object write operations aka compare and set – so called “light weight” transactions in distributed systems. 
  • Durability – persisting state of committed transactions in case system failure 

 

Dbs provide concurrency control – different transaction isolation levels: read non-committed, read committed, repeatable read, serialised to satisfy following conditions:

  • Reading your own writes – read after write consistency (for example to cross device access)
  • Monotonic reads – read from different replica (possible solutions is to read from the same)
  • Consistent prefix reads – if a sequence of writes happens in certain order anyone reading those writes will see them in same order

Read committed – read operation will see only committed values. Default transaction isolation level for many db – postgres, oracle. Implemented by row level locks for writes + memorised values before and after transaction, return old values for any read ops before new value actually committed.

Snapshot isolation – each transactions read from consistent snapshot of the db – at the start of transaction, some times called repeatable read.

MVCCmulti-version concurrency control maintaining state of db in different point of time. Some times can be used as alternative to locking. Key principles – readers never block writers and wise versa,  maintaining set of old committed versions for every transaction in progress. Updates usually implemented as delete + insert with transaction id for reference. Db index in this setup may refer to all versions of values or copy-on-write (B-tree may be used). 

One of anomalies related to simultaneous writes – read-modify-write cycles – lost updates (incrementing counter or modifying a list). Possible workaround – atomic write but db usually doesn’t support it for all data types. Alternative is to rely on explicit lock by app code. Many db engine provide lost update detection that require transaction to be re-tried. For non transactional db atomic compare and set operation can be used to prevent lost updates. For replicated data – need conflict resolution technics, common is last write win – error prone, commutative atomic operations – good.

For write skew – when the same objects are red by multiple transactions and some of them are updated by different transactions – best option is to rely on serialisable isolation level – may be actually executed in parallel but result are the same as serial execution. Examples where it actually important: games – move different figure to the same cell or booking system, that prevent double booking – or double spending from account. Common pattern for write skew is: select + logic + write.

Options are:

  • rely on serial order (as in Redis), sometimes through procedures.
  • partition per cpu core => cross partition coordination very slow.
  • data fit in memory + anti-caching (abort transaction load in memory retry)
  • Two phase locking (2PL) – readers and writers block each other (opposite to snapshot isolation), shared and exclusive locks. Predicate locks for the whole object doesn’t present within modern db. Index-range locking – approximation of predicate locks.
  • optimistic concurrency control techniques – serialisable snapshot isolation. try to detect conflict and only if they present abort transactions. One of approach is to materialise conflict – create rows before hand and lock them. 

For distributed databases in order to tackle atomic commits and offer consistency guarantee for situations like consistent snapshot or foreign key reference across partitions (when transaction fails on some nodes but succeeded on other) commonly used two-phase commit protocol (2PC) :

  • save data at nodes
  • send prepare requests to participating nodes
  • send commit if all participant on1st stage responded with yes

It involve coordinator or transaction manager – if it fails before making prepare\commit(?) requests not really clear what involved nodes have to do. XA – extended architecture – standard for implementing 2PC across heterogeneous technologies ~ C api for coordinating with transaction coordinator 

Total order broadcast (atomic broadcast) – reliable delivery of messages (if msg reach one node => it will be delivered to all) + totally ordered delivery (msgs delivered to nodes in the same order) – even if a node(s) or network are faulty.  ~ asynchronous append log

Transaction isolation levels is a safety guarantees, and are not the same as distributed consistency (coordinating state of replicas in the face of delays and faults) – it is not set of operations within transaction, but status of individual object replicated among several nodes. Linearizability – recency guarantee – so called atomic|strong|immediate consistency it can be treated as impression(!) of single copy of data with atomic operations. Even strict quorum can not guarantee it. Read repair can help for read and write, but not for compare-and-set\increment-and-get as it requires consensus among the data nodes. 

Fundamental characteristics of consensus:

  • epoch sequence number with every request to filter stale data
  • choosing leader
  • vote on leader’s proposals.

Key difference in comparison with 2PC – leader are not elected and we do not need all nodes to agree –  majority of votes will be sufficient. Another feature of good consensus algo – ability to add new nodes  – dynamic membership extension. Consensus services for writes – Zookeaper, etcd. Consensus algorithms: paxos, raft, zab.

Alternatives for ACID

BASE (Basically Available, Soft State, Eventual Consistency) – favour availability over consistency, weakening of acid

Another common concept is a CAP theorem – system is either consistent or available when (network) partitioned (network failure between nodes). Consistency – in reality mean linearizability – update object by key in distributed system would be atomic – i.e. all reads after writes will return new value, partition tolerance – actually mean network partition i.e. network instability between nodes\RACs\Datacenters.

Hadoop 

Map-reduce cluster can be thought through analogy of multiple unix nodes with simple command line tools: awk, sed, grep, sort, uniq, xargs. They usually do not modify input – i.e. doesn’t have any side effects, except producing output to distributed filesystem. Mapper – some function that would be called once for every record – can generate many output which will be sorted by engine. Reducer take all records for the same key and iterate over collections of values for each key. File at hdfs considered as separate partition that can be processed by independent mapper task.

Idea is to run compute near data: mapper code (usually just jar files) copied to nodes with data. Partitioning by key we guarantee that all values for key end up with the same reducer. Reducer connect to mapper – download result of map stage – sorted key-values for partition – this is called shuffle. Success is when all mappers and reducers finish, i.e. time of execution is depended on slowest reducers. Provide all-or-nothing guarantee – output from failed task discarded. It is not as MPP where focus is on parallel execution of analytical query on cluster of machines. Map-reduce and distributed filesystems can be thought as some generic Os that can run arbitrary programs, for example: query engine – Hive, oltp – HBase, mpp – Impala.

Dataflow engine (Spark, Tez, Flink) can be use to run workflow from Hive or Pig. DAG – directed acyclic graph –  represent flow of data between operators. For graph processing – Pregel model – (Bulk synchronous parallel model of computation) – vertex have state and at every map call get message(s) (fault tolerant and durable) from other vertexes (for example along the edge) .

Joins:

  • Map-side joins. Hive require specifying hot keys explicitly and use map-side join
    • partitioned hash join (bucketed map joins) – both join input have the same number of partitions – hash join can be applied to each partition
    • map-side merge join – when both dataframe not only partitioned but also sorted based on the same key
  • Reducer side joins:
    • skewed join or sharded join – hot keys to multiple reducer (replicate other join input to multiple reducers).
    • broadcast hash join – copy small tables to all reducers

Hive metastore contains info about encoding format, name of directories where data stored, number of partitions and the keys by which it is partitioned and sorted. Sort merge join – may use secondary sort (user-birth->[activity list])

Data streaming

Event stream: producer (publisher\sender) and consumer (subscribers\recipient). Events are grouped in topics or streams. Publish\subscribe model – when messaging system is used to notify consumers about new events. If producers faster than consumers: dropping\backpressure\buffering. Direct messaging vs message brokers(queues). Multiple consumers can be used for load balancing – i.e. one msg for some consumer vs fan out when one msg delivered to all. It may be important to wait for acknowledgements from consumer before msg deletion from the queue. Msg reordering may happening due to network retries.

Jms\AMQP based (RabbitMQ) vs log-based (Kafka) brokers. Topic – group of partitions ~ files (may be at different machines). Offset – sequence number within partition – msgs within partition are totally ordered. No ordering guarantee across different partitions. Consumer group have offsets per partitions to mark what they read so far. Scalability is achieved by assigning the whole partition for consumer node i.e. at most we will have number of consumers equal to number of partitions. Slow msg processing = hold all msg in partitions. 

Changelog or change data capture (CDC) – observing all changes to db and extracting them in form they can be replicated to other systems. Event sourcing – similar concept as cdc but not at the level of db operations – instead it will be app specific events. Difference is history of modification vs current state: cdc – complete row for key, event sourcing – later event not override prior, doesn’t contains mechanism of state update. Command -> validation(?) -> event~fact.

Command query responsibility segregation (CQRS) – separating form in which data is written from how it will be read ~ different read view.

Complex event processing (CEP) – query and data roles reversal – store query to find patterns but data are transient. Example is percolator feature of elasticsearch. Stream analytics use probabilistic algos – bloom filters for set membership, hyperloglog – for cardinality estimation.

Within stream processing topic of estimation time of event become not so straightforward: time of action + time of send to server + time of process by server.

Various time-based windows are possible:

  • tumbling – for example fall in wall clock 1-min interval
  • hopping – aggregated overlapping tumbling
  • sliding
  • session

Joins at stream: require some state. If ordering for events is not determined, join become nondeterministic – slowly changing dimensions (SCD) – addressing it by adding ids for particular version of join – downside breaking log compaction. 

  • stream-stream
  • stream-table – cdc to sync point-in-time cache
  • table-table ~ mat view maintenance

Microbatch and checkpointing for fulfil exactly once semantic IF no side effect: idempotence (run many times have the same effect as once, i.e. set value vs incrementing) or distributed transactions.

Control of load:

  • client side backpressure or flow control
  • exponential backoff to fight with overload
  • phi accrual failure detection (akka, cassandra), tcp retransmission – measure response time and their variability(jitter) adjust timeouts according to observed response time distribution
  • quality of service (QoS – prioritisation and scheduling packets) + admission control (rate limiting senders)

Byzantine fault tolerance

When node try to trick system – byzantine fault, reaching consensus in this environment – byzantine general problem – node malfunction and not obey the protocol. Partially synchronous model with crash recovery faults – common model for distributed systems. Safety (nothing bad) and liveness (eventually good happens). Limping but not dead nodes. Fencing token – auto-incrementing sequence during lock acquisition – preventing inadvertently acting nodes – work in environment where nodes are honest. 

Chronicles of one crypto-arbitrage bot

TLDR;

Chronicles of one crypto-arbitrage bot

Some time ago friend of mine, innocent in terms of exposure to bloody details of executing mid-term IT projects, sold me yet another brilliant idea of how to effectively decrease amount of leisure time – by diving (again) in horror-land of startups.

It was excited journey with up, downs and sort of Challenger Deeps that empowered some of my beliefs in regards of general principles of software development, made me treat another “rules” in a more complimentary way, and definitely allow to grasp over few interesting tricks.

With this article I want to highlight curious constraints that we have to overcome; real-life uncertainties, shaping my architectural approach and illustrate, how project evolved during growth of code base and changing requirements.

As years went by I tend to think that there is two main approaches for starting your own startup:

  • you try to foresee any possible outcomes, address all edge cases, horrified by volume of work and end up with “Oh, hell, fuck it, no!”
  • you start with “Fuck it!” jump in with some awkward prototype written during the lunch to see whether this shit work at all.

 

Bright Idea

It was early 2017, term ‘crypto’ starts appears in mass media, video cards were out of stocks almost everywhere. ICOs, promising various kind of digital revolutions, wide spreading as bubonic plague during ancient times, raising crazy money in exchange for vaguely compiled pdfs. Gold rush of modern century as it is.

There was bunch of crypto-exchanges offering to trade alt-coins: different subset of trading pairs,
complex rules of fees, volumes of digital “commodities”. It was distributed and (almost) non-regulated market where price at the same coin may be different among exchanges.

Idea was pretty simple:


Deposit money at supported exchange, monitor price difference for some coin, as soon as it exceeds profitability threshold – sell at one exchange, buy at another exchange – just to maximise absolute volume of all coins in all our account’s wallets. Approximated opportunities window (based on manual trade’s execution) sometimes reach up to 15 minutes – i.e. it was possible to send missing coins from another wallet to exchange and trigger necessary order.

All of this still sounded pretty easy – so we agreed on investigation stage: collect fluctuation of prices along the day and analyse them to better understand prospects of project.

At that time I can’t distinct ask from bid, how order different from trade, what is the candle or ticker and how price regulated at all.

Very brief intro to terminology:

Order it is when you register within exchange your desire to buy or sell particular crypto currency. If you want to buy – it called bid, sell – ask.
When someone else decided to put order for the same coin and price be matched – trade(s) will be executed – i.e. coins travel from wallet to wallet, exchange charge their fee, order’s volume will be updated to reflect executed trade, when it become zero it mean that order fully filled.
Volume (amount) it is exactly what you want to trade, but price – depending on exchange, can encapsulate several strategies, most common of them is to define exact price.
It has one issue though – price is changed based on current state of order book – i.e. if no one want to sell something for price that you set – you can’t buy anything.
On practice it mean that if you set price according to current top order, then during period since you click submit – till the moment exchange noticed it – someone else may purchase everything and the next lot in order book would have another price.
That means no matching order – your order may hang, and, depending on exchange rules, may be expired in couple of weeks, depending on price movement.
To overcome this inconvenience another common option is to use ‘market’ price – when you definitely want to participate in trade on best possible real cost (if order matching engine implemented properly).
Ticker – summary, that highlight changes of trades for fixed time period: highest ask, lowest bid for coin pair, and other details that varies from exchange to exchange.
Candle – usually have more verbose information – open-close-high-low prices for time period and volume related info.

First blood

So, yeah, returning to the topic of getting the data from exchange.
I have started looking at public api for chosen exchanges. Can be a great illustration for Noah’s Ark – a lot of very different beasts. In the world of classical exchanges Fix protocol is a common way to get updates from exchanges but even now it is almost not supported in crypto world. Web sockets were complex to use and not available at majority of exchanges. Orkay – it meant we are working through REST. Get data from tickers and save it to look later. A piece of cake!

Initial tech stack and reasoning behind it:

  • I do not expect to have some crazy data not in terms of volume not in terms of operations.
  • Database? But maybe this thing will not fly? Csv files are still very convenient to deal with! (yeah, right)
  • Language – speed of implementation that was what matter the most – so no exotic I-want-to-try something new, no complex power of c++ or scala, no verbosity of java – I just stick with python. 2.7. Because, umm, well, it was installed by default and it has everything as 3? (yeah, right)

Those initial code several python modules and single ugly ipython notebooks probably not even committed, but looking at tickers I can’t believe my eyes – this is too good to be true. Either there were some very core errors in our reasoning (or in the code), or we should abandon all other life activities and dive in implementation of remaining functionality asap.

We will be rich soon!

After discussion decision had been made to start iteratively.

During the first stage I will create prototype application that

  • collect data for further analysis – Arima combined with the most advanced tricks from technical analysis should give us ideal prediction (yeah, right!)
  • trigger notification if observed difference noticeable enough to execute manual trades.

Technicalities:

  • Notifications. It is strange but not all people find amusing digging through log files or read emails. On the other hand telegram has free bot api, mobile and desktop clients with nice UI to simplify analysis.
  • Every exchanges choose to have their own unique naming convention for coins. All right we will create intermediate naming layer and exchange specific conversion routines.
  • Database. Yeah. Probably it is time to have it as collected csv starts exceeds Gb in size after couple of days of running. What about constraints?
    • Hundred writes per second seems to be very safe upper bound (yeah, right).
    • Reads? Only when analysing data (yeah, right).
    • I do know about normal forms, so with proper approach un-voidable schema changes should be cheep.
    • I do not yet know which data dimensions will be the most important for us – so I want flexibility to tweak data tuples based on every dimensions (columns).
    • Implementation language may be changed in future (yeah, right) – so wide choice of _mature_ client’s libraries is definitely an advantage.
    • I have worked with postgres – which seems to satisfy all the above and have a lot of room to go above – thats another reason to use known technology.
    • Just to keep in mind that things still may be changed so some simple abstraction layer that allow to switch database are always good.

When plan is ready it is very easy to type code – no matter that it is after intense working or family hours – pick issue and execute. But… yeah it would be to easy without them:

First obstacles

There is famous quote «Data is modern oil» but exchanges not really keen to share their data – and in case you need granular timed snapshot of market state the single way to acquire it is to collect itself (or buy). Order book may have dozen thousands of bids and asks. If you request it every second – majority of data will be duplicative and its retrieval will be just annoying overhead for exchanges. If every second you want to get info about hundreds of coins – it can be considered as some kind of ddosing. There is a lot of policies how to control end-point abusing: weighted cost for every API invocation and maintaining client’s balance, renewable every quant of time; share quotes of request – per IP or api key; specify global hard limits and as soon as client exceed allowed number of connections\requests throw 429, 503 or even disrupt all communications with offender. It was very interesting to learn such diversity of load balancing approaches by practice – troubleshooting data retrieval processes by playing with timeouts between invocations and spreading applications among several VPS.

All logging related activities are under constant threat of becoming too noisy and be ignored. And it is always very tricky to choose balanced strategy to prevent recipient from creation of special rule for incoming mails to automatically move everything to the trash. As I mention earlier time gap for opportunities window sometimes used to reach 15 minutes, we analyse price every second and arbitrage event may occurs for multiple currencies. As result our telegram channel were bombarded with identical messages making it un-usable (and telegram doesn’t hesitate to ban us for cool down period as well so we have to add some back-off mechanism as well). In order to make navigation through alerts history in telegram channel more simple we introduce naming convention to have pair of exchange and name of currencies as tags. Additionally we have to went through a lot of iterations in regards of alert attributes: severity, depending on size of possible profit; time – our local? (but we were in different timezones), exchange time (but due to their geographic distribution it also be different), UTC from server where we run process. And, yeah, this channel at telegram were muted. For the same reason of huge amount of notification we disregarded idea of adding simple controls to be able to manually confirm order placements.

We decided to store details about observed arbitrage events as well, history retrieval were ready with some precautions for timeout handling and restarting. Now question where actually store data. If you are lean startup and your free EC2 micro instance were expired, but you do not wish to invest time in investigating of alternative options from competitors your option is to have «in-house» infrastructure. I.e. just run `docker pull postgres`, map volume for persistency and to speed up things a bit (just in case for so cold premature optimisation(C)) disable network virtualisation. Now how to access it? Classical setup is following – you have router from internet provider with dynamically allocated public ip address. Depending on your luck and location (country of residence) ip address will be periodically changed by ISP provider. Behind router – in your home private network ip addresses usually assigned by dhcp protocol – i.e. you restart your host where database server is running and it may get another private ip address. In addition to this it is your home network – i.e. NAS with private photos or laptop with work’s projects. There is a lot of various services offering dynamic dns – you have to run special daemon at your server that will update bonded ip address for chosen domain name – I have chosen https://www.dynu.com. Second simple step is to configure your workstation to have statically assigned ip.  Many young hacker’s bundles offer convenient interface over nmap to scan for open ports and check whether by any chance you allow ssh access using login and password.  In order to have at least some protection layer port forwarding were setup using not standard port, ssh was configured to provide access by key file only and old good ip tables allow you to configure banning ip address for failed authorisation attempts. With all this pre-cautions – you can initiate a tunnel with port-forwarding in such way that remote postgres process will be bind to your localhost:1234.

Real programmers test in production

Reactive asynchronous program dealing with real world not so easy to test. And the main reason for it – infinite number of possible combinations of weird situations to happen. Blackout in your home office, un-noticed maintenance of exchange due to recent hacking, issues of internet provider – you name it I saw them all!

First launch with real money went in following way:

  1. we run multiple processes
  2. telegram channel were bursted with endless messages with details of order placed
  3. I saw that telegram started exponentially increase time outs to forward messages from the queue to the channel – i.e. my code do something and it is not visible for us (Oh, shiiiit)
  4. my partner was trying to keep up with message flow – but it was impossible as for every message he have to check two different exchanges filtering by currency.
  5. We need to stop it! Now! Side note: Let’s add to backlog new feature request – we need some mechanism to stop all of them once. At that time I just run pkill to everything with python.

After 6 months of development first 20 minutes of running reveals various issues ranging from annoying to critical.  And in order to investigate what was happening and find out root causes we have to dive in logs and history of orders and trades – situation from bots perspective were quite different in comparison to what exchanges thoughts.

Exchange performance

Kraken – it just doesn’t work. I am still have old script to issue placement of thousand orders and compute ratio of success. At that time it was like 30 something from the whole thousand. At least it was easy to verify and reproduce.

Maintenance of up to date balance state

You have to know how many money you have – as it define scale to what you can trade. In crypto-exchange world there are primary coins – BTC, USDT, ETH – they used for trading as base currency; and secondary – myriads of alt-coins that traded for base. If there are many processes using same wallet it mean that any of them may change remaining volume of base currencies. So issue related to maintaining up to date state of balance was much more trickier. Initially, every arbitrage process sync it independently with every exchange before computing details of order to place. In case of any network issues during balance retrieval – rely on previously retrieved state.

Some of exchanges distinct total and available balance by introducing freezed balance i.e. portion of coins were allocated to registered orders, that does not yet executed. When you deal with response you have to pick proper field to be sure to what extend you can go crazy with trades. Sad thing that not all exchanges actually have it. (Kraken) But anyway I completely missed this bit during implementation – so we have several cases of «Not enough balance» errors during order placement but paired order at another exchange successfully placed and executed – i.e. direct loss. 

Another issue related to balance – burst of requests from the same ip. Balance API is specific for user, and considered as private API with authorisation i.e. more sensitive to load. When price on some coins fluctuating we experience situation that many processes requested it within same time window and response may return with delay or even timeouted. Exchanges start completely ban ip address which was actually fine as no trade were possible in this case because even order book can’t be retrieved. Throwing timeouts, on another hand, was catastrophic as bot have to wait for timeout period (i.e. some one else can place this order) or rely on outdated balance state to place order on one of exchanges, failing to do it at first because of not enough balance.

As we already started experimenting with multi-node deployment – solution to issues above were inject balance retrieval to dedicated process that every second update balance state in redis cache. Every arbitrage bot may access it and check, if last date of update became too old – immediately shutdown itself. After order placement – process forcefully update balance at cache. This approach was optimistic as well – even after execution – it take some (tiny but still may be crucial) time to reflect changes – just humble attempt to minimise time window of uncertainty.

Order book analysis

Initial approach for analysis of current state of market was take the top bid and ask, compute price difference, compare with issue orders, take a look at second pair of bid and ask. But at the same time other players may affect order book – so whatever we are dealing with is already out of sync, and we saw situation when we try to bet on already missing bids, So we decided to do processing of order book only once. Second issue – sometimes price were different, but volume wasn’t enough to get profit. It varies from coin to coin, so we have to pre-compute and maintain volume cap, dependent on most recent coin price and exchange fee, to prevent placement of order leading to loss and use for analysis not only first bid\ask but median of topN of them to approximate possible delay of processing and activity of other players.

Reverse coins movement

Initial idea was just to maximise total balance across all exchanges, but in this case it will be possible (and it happens to us as well) to use all coins from one exchange so just can’t trade it anymore. So we decided to add another mechanism to re-distribute balance evenly across all exchanges by using the same logic as for arbitrage but with lower threshold – i.e. even if it will be zero profit at the end – we still meet our goal. Those logic were triggered for particular currency only if observed dis-balance on pair of exchanges exceed configured threshold.

Busy days

On the rise we have 5 EC2 instances, each running up to 80 processes, history data were stored at Postgres at RDS (bills up to 600$ for this one? Hell no, lets migrate back on self-hosted solutions!). Deals channel at telegram beeped every day. It doesn’t mean we don’t have issues but situation were stable enough to check it every evening for confirmation that everything were fine or re-prioritise backlog.

However first days of live trading brings few obstacles:

  • Logs tend to consume all disk space (yeah, it was very inefficient text file logging) so processes tend to dies on file write (simple logrotate configuration with cron solve it)
  • Deployment also become an issue – manually starting hundreds of processes is a bit overwhelming. Another thing that processes are not daemonized (yeah, I know) in order to have some kind of heartbeat within console as opposite to search in logs by PID and time, so to avoid any issue with SIGHUP they have to be deployed under screen sessions. First we will create new screen, named by pair of exchanges, inside of it we will have dedicated console named by trading pairs to simplify troubleshooting. Same things for other supporting services: history retrieval, balance updating, telegram notifiers, etc. So I drafted few scripts and middleware to automate it within single server. And bash alias to have command to stop processes without screen shutdown.
  • Almost every exchange required to have nonce as part of payload for trading related requests – incrementing sequence of integer number. I have experimented with various approaches but we end up with relying on redis as well to share it across many processes relying on the same api key.
  • Re-balancing initially were run as independent processes and sometimes they compete with direct arbitrage processes, so decision have been made to run them sequentially within the same process

Conquer new heights

  • When we start adding more coins and new exchanges – errors handling become more and more important. On every request we may get in response timeouts or errors (and it actually may mean that order placement were failed, but also it may succeeded!). Errors may also be returned as part of payload – i.e. http response was 200, but payload said error something. Additionally what bot was doing so far was just order placement – i.e. no any guarantee that trades were executed – i.e. manual work required to review open orders and process them. Our solution to both these issue were introducing two priority queue backed up by redis to keep track all placed orders sorted by time. Every queue may be read by one or multiple consumer process that check status of order with the exchange and properly process it.
  • From time to time speed become an issues – we noticed that not always can issue order in time python 2.7 doesn’t have support for async so I have added gevent and pool request, not so convenient to deal with and it doesn’t completely eliminate issue – as we are still operate on top of snapshot of order book
  • Architecture of module, probably, partly a bit verbose, but proof itself to be flexible enough to fuse new exchanges with their quirks in terms of new exchanges’s API integration and addition of new currencies.

Problem of floating point arithmetic in python.

Rounding is a known pain in the ass in many programming language (https://en.wikipedia.org/wiki/Round-off_error) and python not an exception here. Apart of situations when we sell more or less than expected, there were other cases when it affected us. Some exchanges are very strict in cases when you submit volume value with more than expected precision – when API expect up to 6 numbers but receive number with 7 digits after decimal point – the whole order will be rejected (and paired order at another exchange may be executed as they may have more relaxed rules). Those rules are specific to exchange, vary from currency to currency – i.e. you can have 0.00001 of BTC but not for some obscure coins. And, guess what, those rules may be changed – so you have to monitor for any ‘precision’ errors on order placement. Another annoying thing – implicit fallback for scientific notation for string representation for small numbers. I do understand meaning of 1e-6 but usual api is not so smart. In order to overcome all those issues it wasn’t enough just to switch on Decimal class but the whole new layer of conversions have to be implemented to cut redundant part of string representation based on exchange specific rules.

Brightest prospects

Many development activities happens outside core module:

As size of database grows we start thinking about encapsulating news feed and various forum’s rumours to checks whether it correlated with fluctuation of prices. Inspired by this aim independent project were started to create web-crawlers for twitter accounts, telegram groups, medium posts, reddit threads and few more specific resources to get all information within some date-time range. Every new source of information brings new surprise – for example in order to work with telegram you have to build native extension to use java wrapper through JNI. In order to efficiently get tweets with comments for curated accounts we have to maintain set of keys and rely on key rotation mechanism as soon as api start throwing timeouts. Important aspect there – time handling. World are big, complex and have many time zones which sometimes may be reflected with web-page or response in all its cheer differences, so we have to adjust everything to common denominator – UTC – as we have done for market data.

Another direction was to simplify bot management & deployment to decrease operational toil. As we already started use redis for sharing common information (nonce) and maintain alerting messages and order’s queue, next steps was to extend this mechanism to the next level – build sort of message bus for inter-process communication. Other important functions are: automation of deployment of services to a new node in a bit more advanced way that just cloning the whole repo, update configuration for particular server, shutdown all processes at all registered nodes at once. In order to flexibly manage newly added EC2 instances to the system we add ability to deploy agents. It is a special daemon that control arbitrage and supplementary processes at node, listen commands from UI of command centre. From monitoring perspective – it would be much more handy to review state of processes by skim through colour of trading pairs per servers reflecting delays of heartbeat, than ssh’ing inside screen of every server to watch for heartbeat messages in console. Many challenges were overcome on this path as well: flask & dash integration, dynamic UI creation (I am not the biggest fan of frontend) with ability to forward input from newly created forms into redis.

The fall of Rome

And suddenly something changed – just in one single day we no longer were able to place orders in time for few currencies. Either some one plays with scalping (which, in theory, exchanges tries to prevent), or someone were able to prepare efficient ML model to predict price movements within second intervals, or someone just can do it faster. Even now it is not the case for all currencies for all exchanges – as individual or institution have to pull a lot of money to cover everything, but it affects profitability so direct arbitrage approach probably will not bring you too many money. The last attempt that I have made was to switch trading on web-socket basis. Idea was very simple – you pull snapshot of orderbook and subscribe for updates to sync and then maintain actual state locally. Subscription – means event loop in dedicated threads. Asynchronous code with threads not ideal use case for python 2.7 so I devoted good amount of time to add option to show in real-time top bid and asks from both order books for debug purpose. But several trial sessions with this approach doesn’t reveal too many opportunities for profit and show general instability of subscription itself that may lead to re-syncing and potentially error-prone.

Lessons learned

  1. Exchanges – just another piece of software with all related issues of big complex systems – reliability, consistency, performance. So you have to deal with it with all possible precaution: stale orderbook, delays with balance updating and order placement, errors management.
  2. Vps is not real hardware – you still share resources with other processes. First time I saw at htop metrics steal time!
  3. Python for mid size projects – great speed of development, rich ecosystem – google it and use. Be careful as you never know how that guy implemented those function that you need inside library – if you care about performance – try to minimise external dependencies as much as you can.
  4. Implementation of some builtin algorithms may be not as you expected – check set intersection or insort.
  5. It is useful to use some weird default values: -1, 100500 – when you stumble across them in logs you (almost) 100% sure that something went wrong.
  6. If not sure – better just shutdown process.
  7. Shell programming – ancient mortal art that can save a lot of time for supplementary task for example to analyse logs.
  8. Bugs related to lack of types — stupid and annoying errors that requires long time to troubleshoot
  9. Code indentation – when by mistake block of code moved inside or outside of cycle or condition – unique «feature» of python – also require thorough investigation and not so easy to find
  10. python and date
  11. python and float
  12. cost of function call may be high – inlining may lead to code duplication but may be faster

On interview preparation: nuances of python

From my perspective mastering particular language is still secondary skills for programmer.
For sure it will affects architecture of software and may add advantages for some use cases of handling data flows, but overall, in the long run, decent design of system coupled with carefully catered algorithms much more important.

Nonetheless within this article I want to highlight several topics that usually distinct python veteran from novice. Part of these questions I was asked about during recent interviews, some of them I had asked myself to distinct people, who name themselves python expert from those who really are, based on long exciting hours of debugging sessions from real products.

Why I want to start with python and not serious languages c++ or scala?
Currently its de-facto standard second language accompanied more performant one and
in many domains it is considered to be a primary language because of simplicity of entry,
rich ecosystem and variety of libraries.

Interpretators:

  • CPython – default one
  • Pypy – If you need a bit faster python (JIT compiler with faster version of garbage collector)
  • Jython\IronPython – in case you need integration with java or .Net world
  • Cython – you need more speed, numpy, pandas and scikit can not fullfil what you need,
    but you still afraid to accept that c++ may be your best friend.

Dependencies management:

Standard practice is to isolate dependencies within dedicated sandbox of virtual environment with all required packages and python interpreter itself installed in such way that it doesn’t confronting with system wide installed libraries.

virtualenv is mainly used for it, but virtualenvwrappers may be more handy. In case it will be necessary to maintain several python versions simultaneously Pyenv can help.

Requirements.txt – should contains list of all dependencies.
Using –index-url you can specify your own private package servers instead of default one – pypi – including private repos with access through ssh or just local file path:

file:///path/to/package#egg=my_package

Distribution and Installation:

  • setup.py – during automatic package installation requirements.txt are not analysed
    anyhow – what really matter is setup.py file used by setuptools. Its worth to be aware about install_requires directive, that should contain list of all dependencies.
  • wheel (or egg) as common distribution formats.

Popular language’s caveats:

Mutable vs immutable datatypes

You can’t change int or string value  of some variables – you have to replace it with new value. But you can add element to list or remove key from dict. Custom objects usually mutable – you can go crazy with altering their fields or adding new methods.

Why it maters?

Mutating default arguments – perfect case for hard to find bugs:

def some_method(b=[]):     # WRONG, use b=None instead
    b.append(1)
    print(b)
    return sum(b)

for x in xrange(3):
    some_method()

Arguments are passed inside methods by reference i.e. it is not copy of object.
In case they immutable and you try to modify them inside of function – new local instance will be created – those procedure named rebinding, if they are mutable – you can end up modifying originally submitted value.

How to make sure that it is actually mutating?

x = 100500
y = 100500

id(x) == id(y) # False, values are the same but objects are different

Method id return identity of object (sort of virtual address of variable if you are used to C terminology)

z = x
id(x) == id(y) # True, equal!
z = 127
id(z) == id(y) # False, as we make re-assign z to another value

Note however:

x = 250
y = 250

id(x) == id(y) # will be True 
"""because of optimisation of interpreter for small numbers:
https://docs.python.org/2/c-api/int.html"""

x = "STR1"
y = "STR1"

id(x) == id(y) # may be True as well
"""
interpreter's implementation specific
for immutable types, operations that compute new values may 
actually return a reference to any existing object with 
the same type and value:
https://docs.python.org/2/reference/datamodel.html#objects-values-and-types
"""

Operator is  is doing exactly this – comparing ids of two objects – whether they both reference exactly same memory cell.

So with brief ideas about value and reference, we can tackle equality operation:

class A():
    def __init__(self, a):
        self.a = a

x1 = A(1)
x2 = A(1)

# Both will be False
x1 == x2
x1 is x2 # equal to id(x1) == id(x2)

By default, comparison operator == invoke __eq__ method and it use ids of objects for comparison.

Truthiness vs “is None”

class A:
    def __init__(self, a):
        self.a = a

class B():
  __bool__(self): return False

l = [float('nan'), 0, 13, 0.0001, "", "False", [], A(None), B(), False]

for entry in l:
    if entry:
        print("{} - True".format(entry))
    else:
        print("{} - False".format(entry))

Is None – is nothing else as straightforward check whether some variable reference the same address as None value. But what checks are performed during invocation truthiness check for some_object? Under the hood it just evaluate objects – check for existence of implementation of __bool__ or len method of object, if they are not present – treat object as True.

Hashing in python

Common question to confuse people – why by default hash is defined for immutable objects only:
reasons will be very obvious if you think about example below:

class A():
    def __init__(self, a):
        self.a = a

    def __hash__(self):
        return hash(self.a)

    def __eq__(self, other):
        print("__eq__ invoked!")
        return self.__class__ == other.__class__ and self.a == other.a

container = set()
c = A(1)
d = A(10)
container.add(c)
c in container # True
d in container # False
c.a = 123
c in container # False

I.e.  after we change value of object’s attribute – we can’t really check whether it was in container. That make sense! But what if we do this one:

d.a = 1
A(1) in container # False, and call of eq method
d in container # False, and call of eq method

I.e.  such kind of checks always invoke __eq__ for cases when hashes are similar in order to deal with possible hash collisions.

What it mean from practical point of view:

  • if object1 equals to object2 their hashes also should be the same
  • in theory it is possible to use identity of objects for hash computation and compare values in case hash collisions

What will happen if we try to compare lists?

a = [1, 2, 3]
b = [3, 2, 1]
c = [1, 2, 3]

a == b # False
a == c # True

Check equality for every pair of elements with the same index – so order matters!

Copy – shallow and deep

What if we add this to our equation?

d = a
a == d # True
print("{} = {}".format(id(a), id(b)))

I.e. we have similar address that d & a referenced.
Due to it – if we add element to a, d will also have this element.

It mean that in case we need two independent instances of similar list we need to use
deepcopy. And it will be necessary for every mutable type (or collection).

Python sort of names mangling:

Python offer interesting approach to prevent name collisions in case of inheritance.
Knowledge of this trick can save your time while working with someone’s code.

class B():
    __attr1 = None

u = B()
u.____attr1 # AttributeError: B instance has no attribute '__attr1'
u._B__attr1 = 123 # All right then

Slicing tricks:

Reversing string, pythonic way:

a = "Acapulco"
a[::-1]

And familiarity with slice notation

array[start:end:step]

may be beneficial by providing very compact code:

g = [x for x in xrange(10)] # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
g[::2] # Even only [0, 2, 4, 6, 8]
g[-3:] # Last three elements [7, 8, 9]

Init list with similar values:

b = [-13] * 10

But don’t get in trap of using:

b2d = [[-13]*10]*10 - # WRONG! will create 10 reference to the same list

this is valid way of doing this:

b2d = [[-13] * n_cols for i in range(n_rows)]

How to work with resources using with notation:

with open("in.txt", "r") as input, open("out.txt", "w") as output:
    output.write(input.read())

Benefits – implicit handling of resource release in case of any exceptions

And yeah, nothing will stop you from adding to your own objects similar functionality
just add to them __enter__ and __exit__ methods.

Talking about exceptions:

Keep in mind:

  • possibility to rely on exception’s inheritance hierarchy
  • using else when no exception is raised within try block
  • finally is similar to what you may seen in other languages – will be executed anyway.
import random


class MySimpleException(Exception):
    """Base class"""


class MyMediumException(MySimpleException):
    pass


class MyCriticalException(MySimpleException):
    pass


class MyCustomException(Exception):
    pass


def some_method():
    res = random.randint(0, 10)

    if res == 0:
        return res

    if 1 &amp;lt;= res &amp;lt; 3:
        raise MySimpleException("MySimpleException")
    elif 3 &amp;lt;= res &amp;lt; 6:
        raise MyMediumException("MyMediumException")
    elif 6 &amp;lt;= res &amp;lt; 9:
        raise MyCriticalException("MyCriticalException")

    raise MyCustomException("Bad luck today!")


def exception_handling_example():
    try:
        some_method()
    except MySimpleException as e:
        print("Any exception within class hierarchy: " + e.message)
    except Exception as e:
        print("Any exception" + e.message)
    else:
        print("No exception at all")
    finally:
        print("This will be executed no matter what")


for x in xrange(10):
    exception_handling_example()

Iterators, generators & yield:

x = range(1000000) # Python 2.7.*
import sys
sys.getsizeof(x) # 8000072

def my_generator(n):
    num = 0
    while num < n:
        yield num
        num += 1

x = my_generator(1000000)
sys.getsizeof(x) # 80

Range will create list with all elements in memory.
my_generator(1000000) will create special object – generator – that can be used in loop –
in our example it will produce values from 0 up to 1000000 – 1 in lazy fashion:

next(x) # 0
next(x) # 1
next(x) # 2

Many familiar with syntax of list comprehension but may find this syntax a bit confusing:

a = (u * 2 for u in g)

but this is comprehension as well – it just return generator

This is not:

x = xrange(1000000) # &amp;gt;&amp;gt; type 'xrange'
next(x)             # TypeError: xrange object is not an iterator
x = iter(x)
next(x)             # 0

Iterator – is object that have implemented both __next__ and __iter__ method.

Generators is just one particular case of coroutine:

  • within method we have list of instruction,
  • when we hit return, all subsequent calls of this method will re-execute all those instructions.
  • when we hit yield – current state of method (last executed instructions and values of local variables) will be saved and subsequent calls to this method continues from the last statement.

Few words about import in python:

First of all a bit of terminology:

  • module – is just *.py file (or C-extensions – *.pyd or *.so, built-in usually are C-extensions as well)
  • package – set of modules under some folder (for python 2.7.* – it must contains __init__ file)

When we write:

import X
# Just a shortcut to use: xyz.method(args)
import X.Y.Z as xyz
# Will bring to global scope method, variable or complete module Y
from . import Y
# Will bring to global scope EVERYTHING from module X, including its sub-imports
from X import *
  1. Import search module in sys.modules cache that contains previously imported modules.
  2. If not found – search among built-in modules.
  3. And as the last hope – search by filename in every directory mention in sys.path in order of appearance (different version of the same modules in two paths – will load first one)

When module to be loaded is finally found – it will be parsed, interpreter set few variables before execution – for example __name__, and finally execute – hence those famous check for main:

# executed when we imported this module from other code:
# import x.y.z.name_of_this_script, 
# __name__ = "x.y.z.name_of_this_script"
print("module loaded")

if __name__ == '__main__':
    # executed when we run "python ./x/y/z/name_of_this_script.py"
    print("module executed as main")

Parsing may be slow – it first check for pre-cached versions – *.pyc files (with all related last time changed vs timestamp of cached version).

Importing will change global variables and therefore perform some blocking to ensure that nothing
will screw up – even if modules to be imported already in sys.modules cache.

Usually two types of import are distinguished:

from a.b.c.d import Z   # absolute - can be a bit verbose
from . import Z         # relative - add Y from current package

These is just syntax sugar for more compact code:

import X.Y.Z as xyz
xyz.some_method()

from X.Y import Z
Z.some_method()

It still will be loaded fully – not just Z but X, all imports from X, Y, all imports from Z, and finally Z (with all imports as well).

Circular imports

Imagine following project structure:

project
    module1
        __init__.py
        script1.py – contains line from module2.script2 import method2
    module2
        __init__.py
        script2.py - contains line from module1.script1 import method1

python -m module1.script1

When you try to run script1 python will try to resolve all imports first:

  • ok, we need import module2.script2 – lets load all imports for it,
  • within script we have to import module1.script1 – lets load all imports for it,
  • within script we have to import module2.script2 – lets load all imports for it,

In most cases it is matter of proper design, but nonetheless changing imports to be

import module2.script2

will load it once.

Another common approach to fight import’s issues is to use deferred importing:

# imports
# some method

def very_important_method():
    from another_module import smth
    # code

Lambdas

f = lambda x: x.some_attribute
type(f) # type "function"

Single expression – i.e. it should return something and fit to single expression – i.e. sequence of statements that return some value. Example of usage: map, filter, sort.

Memory optimisation

Python is not the most efficient when we talk about performance or RAM consumption:

import sys

class A(object):
    pass

a = A()

sys.getsizeof(A) # 904 0_o
A.__dict__
dict_proxy({'__dict__': &amp;lt;attribute '__dict__' of 'A' objects&amp;gt;, '__module__': '__main__', '__weakref__': &amp;lt;attribute '__weakref__' of 'A' objects&amp;gt;, '__doc__': None})

sys.getsizeof(a) # 64
sys.getsizeof(a.__dict__)   # 280
# but we have freedom to add object and method of object's instance
a.new_int_attribute = 123

Main approach is to explicitly define attributes that should be available for object using __slots__ specifier:

class B(object):
    __slots__ = ['a']
    def __init__(self, a):
        self.a = a

sys.getsizeof(B) # 944
B.__dict__
dict_proxy({'a': &amp;lt;member 'a' of 'B' objects&amp;gt;, '__module__': '__main__', '__slots__': ['a'], '__init__': &amp;lt;function __init__ at 0x109a287d0&amp;gt;, '__doc__': None})

b = B(1)
sys.getsizeof(a)  # 56
sys.getsizeof(a.__dict__) # AttributeError: 'B' object has no attribute '__dict__'

a.new_attr = 23 # B object has no attribute new_attr

Note, that apart of __dict__, another attribute will be missing from objects with implicit definition of __slots____weakref__ – object containing ALL weak references to current object.

Additionally it is possible to use something like namedlist – to create objects that not participate in garbage collections – full comparison of various approach for save a bit of memory is discussed here.  Usually for profiling it is sufficient to use dedicated library like profilehooks.

But nothing is stoping you from diving into byte code as well:

def a():
    print("a")
print(a.__code__)    # code object a at 0x109b2bc30
import dis
dis.dis(a)
dis.dis(a)
  2           0 LOAD_CONST               1 ('')
              3 PRINT_ITEM
              4 PRINT_NEWLINE
              5 LOAD_CONST               0 (None)
              8 RETURN_VALUE

compile("print('Hello world')", '', 'single')   # code object &amp;lt;module&amp;gt; at 0x10db313b0
exec compile("print('Hello world')", '', 'single')   # Hello world

Python is a language with automatic garbage collection. It has two key components: reference counting and generation based garbage collections. When del method is invoked or variable goes out of scope – nothing is actually deleted right away – first number of references is decremented, when it become zero – then memory occupied by object will be available for interpreter again.  Programmer can disable completely generational GC or disable it for particular class by defining __del__ method:

import gc
gc.disable()   # For everything
# OR
class A():
    # ...
    def __del__(self):
        # details
    # ...

gc module allow to manually trigger garbage collection or hunt down possible leaks. It is implementation specific but generally memory will not be returned to OS.

Threading

Python interpreter is single threaded itself – famous GIL. So it may not be the perfect choice for compute-bound application. But for I\O based load – threads in python good enough, especially with gevent in python2.7 or asyncio in 3.7 Alternative is to use multiprocessing module. Or some more suitable language!

Function definitions:

Positional and named arguments – that should be simple:

def method(a, b, c):
    print("a:{a}b:{b}c:{c}".format(a=a, b=b, c=c))

method(b=1,c=5, a="whatever")   # named: will print a:whateverb:35c:1
method(1,5,"whatever")          # positional: will print a:1b:5c:whatever

args  and  kwargs used to specify variable number of arguments. kwargs is named and args is positional

def method(*args, **kwargs):
    for _ in args:
        print(_)

    for k, v in kwargs.items():
        print("{} = {}".format(k, v))


method(1,2,3, a="a", b="b", c="c")
1
2
3
a = a
c = c
b = b

Other

Few useful built-in modules to be aware of and not re-invent the wheel:

  • argparser, configparser
  • itertools – permutations, combinations, product, imap
  • collections – defaultdict, Counter, ordered dict, namedtuples
  • re – regular expresions
  • os, sys, random

Not so standard but handy:

  • requests – to talk with web via POST & GET
  • scapy – for playing with tcp packets
  • scrappy\beautiful soup – to crawl web
  • numpy\panda for memory optimised collections
  • nose – if standard unittest is not enough
  • dash – visualisation

P.S.: Which version to use 2 vs 3:

As for now – in the middle of 2019 – more historical remark:

  • chained exceptions stack trace
  • unordered exceptions during comparison 1 > ‘2’
  • yield from generator_method()
  • faulthandler – tracing kills (except SIGKILL)
  • More handy modules: ip_address, lru_cache, enum, pathlibs, dataclasses import dataclass
  • unicode strings
  • type hints & annotations
  • iterable unpacking
  • dict items order guaranteed 3.7+
  • speed

On interview preparation: algorithms and data structures recap

With this article I want to go a little bit further than widespread memos “X most common data structures for interviews”. I am not only interested in information about asymptotic complexity. The real value – understanding whether particular algorithms or data structure can be beneficial in particular case: real life tasks rarely given with hints like “Use directed graphs, Luke!” or which comparison function should you implement to have your favorite sort algorithm work accordingly to upper bound complexity from textbooks. On this page you can find not just theoretical high level summary but also few facts that I am considering important, practical or just afraid to forget.

Sorting

Few considerations worth to notice before speculating about “Best” sorting algorithms:

  • How much data we are dealing with?
    • Dozens numbers or billions?
  • What is nature of data – i.e. what is cost of element comparison?
    • Integer\strings\complex objects? For example it may involve disks seeks and we should aim for algorithms that minimize comparison or even stick with non-comparison based algorithms?
  • What is the cost of elements transfer?
    • Should we aim to algorithms with minimum number of swaps/data copying
  • What is current state of data? Does it partially sorted?
    • In case of partially sorted data some algorithms may touch they upper bounds but adaptive class of algorithms can deal with this input in efficient manner
  • Are data allow random access?
    • For example: linked list vs arrays. Classical quicksort on linked list always be inefficient.
  • Are data fit to the memory?
    • Otherwise we need modifications of algorithms that support external sorting.
  • Are we are dealing with memory constrained environment?
    • Can allow auxiliary space or have to stick with in-place algorithms?
  • Is there possible duplicative values in array to be sorted?
    • Do we need stable sorting – i.e. should they keep their initial order from input array?

Comparative based algorithms – i.e. we compare the whole values of input array (as opposite to distinct bits of those values).  Any comparative based sorting algorithms lower bound worst case complexity – N log (N).

Comparative based algorithms – out of our league – not used in reality, but sometimes can be asked during interview

Algorithm Time complexity Space complexity Comments
Best Average Worst Auxiliary
Bubble sort O ( N ) O ( N ^ 2 ) O ( N ^ 2 ) O ( 1 )

In-Place.
Stable.

Select sort O ( N ^ 2 ) O ( N ^ 2 ) O ( N ^ 2 ) O ( 1 ) At i-th iteration find index MIN of the smallest remaining entry and swap it i-th element with MIN.
N Exchange. (Stable for list.)

In-Place.
Non-stable.

Practical comparative based algorithms

Algorithm Time complexity Space complexity Comments
Best Average Worst Auxiliary
Quicksort O (N Log N) O (N Log N) O ( N ^ 2) O ( Log N ) O( Log N ) probabilistic guarantee.
Pick a pivot, partition array in order to group all elems less than pivot in the left subarray, and those which are bigger – to the right.Recursively apply.
Pivot may be chosen randomly but there are better strategies like median of medians.
In-Place.
Stable.
3-Way Quicksort O (K * N) ? O ( (2 ^ k) * N) O ( Log N ) Used in case of duplicate keys.
Pick a pivot element and partitions elements into three sets: less, equal or greater than pivot.
There are variants with two pivots.

In-Place.
Stable.
Mergesort O (N Log N) O (N Log N) O (N Log N) O ( N )

 

NOTE: For linked lists:
O ( 1 )

Divide array into two sub-arrays and sort them recursively and merge in sorted order.
In simplest version – till sub-array have size 1, in more optimised – till it reach size of threshold (10-20 elements) and another sorting algorithms can by applied (insertion sort for example).
Easy modification for parallel and external sorting (aka not-fit in memory).
Less comparison than quicksort.
More element transfer than quicksort.
Not In-place.

Stable.
Heapsort O (N Log N) O (N Log N) O (N Log N) O ( 1 ) Construct a max heap mapping array’s indexes to binary search tree.
After constructing recursively delete root from tree until it is not empty.

In-Place.
Non-stable.
Insertion sort O ( N ) O ( N ^ 2) O ( N ^ 2 ) O ( 1 ) At i-th iteration swap i-th element with each larger entry to its left.
Useful for partially sorted array or small arrays – less than 10 elements.

Adaptive.
In-Place.
Stable.
Shell sort O ( N ) ? O (N ^ (3 / 2))

O(N * (Log N) ^ 2)
O ( 1 ) Like insertion sort but we rely on D-order sort, where D – gap between sorted entries.
All estimations depend on choosing gap sequence.
Used in embedded devices due to non-recursive nature – no need for support deep and memory hungry call stack).

Adaptive.
In-place.
Non-stable.
Timsort O ( N ) O (N Log N) O (N Log N) O ( N ) Based on Merge sort and Insert sort:
tries to find a consecutive sequence of ordered elements and merge them to decrease necessary processing to final results.
Used in Java and Python as default sorting algorithm.

Adaptive.
Not In-place.
Stable.
Introsort O ( N ) O (N Log N) O (N Log N) O ( N ) Hybrid between quicksort & heap sort.
Used in c++ as default implementation of sorting.

Adaptive.
Not In-place.
Non-stable.
Treesort O (N Log N) O (N Log N) O (N Log N) O ( N ) Build binary search tree from elements and traverse it in-order.

Variations with self balancing binary search tree or Splay tree (Adaptive).
Estimations based on usage of balanced trees.

Not In-place.
Stable.
Cubesort ? ? O (N Log N) O ( N ) Parallel sorting algorithm

Not In-place.
Stable.

Noncomparison based sorting algorithms. For such kind of algorithms we do not compare the whole keys (values) but using individual bits of values (for example characters for strings or digits for numbers) – it allows to achieve linear time O(N) complexity. Price for it – necessity to tune it for every new type.

Algorithm Time complexity Space complexity Comments
Best Average Worst Auxiliary
Radix sort O ( N * K ) O ( N * K ) O ( N * K ) O ( N + K ) Examines individual bits of keys.
K – max number of digits in array’s elements.
Applicable for integer numbers, string and float numbers in special formatting.
Worth to use in situation when N >> K and K is fixed – famous question about sorting of million of 32-bit numbers.
MSD vs LSD – most significant vs least significant digits.
In-Place.
Stable.
Bucket sort O ( N + K ) O ( N + K ) O ( N ^ 2 ) O ( N * K ) Partitioning an array into a number of buckets.
Each bucket is sorted either by recursive application of bucket sort or using other algorithms.
Can be used as external sorting algorithms.
Asymptotic given for cases when K ~ N.
Applicable in case data evenly distributed over a range.
Counting sort O ( N + K ) O ( N + K ) O ( N + K ) O ( N + K ) Count number of keys related to item, use this count to determine position of elements.
Asymptotic given for cases when K ~ N.
Applicable in case N ~ K.
Stable.

N-th biggest\smallest element

just a variations of partial sort:

  • QuickSelect – worst case O(n^2) best case and average – O(n) – as quicksort with pivot based partitioning but move in one direction only
  • Median of medians – improve worst case of quickselect to O(n) – partition input to group of several(5) elements, find median within every group, and return median of n/5 medians
  • Introselect – back up nth_element at C++ STL – start as classical quick select and switch on median of median method in case depth of recursion is too deep and at the same time sizes of sub-partitions observed so far not leading to halving.

Hashing

Family of hashing functions: H(X) = ((a * x + b) mod p) mod N) a,b < p, a!= 0, p – prime number used to reduce number of collisions (it depend on your keys distribution).

Two main approaches: separate chaining vs open addressing (linear probing) with double hashing.

  • Separate chaining – array of lists, at the very worst case – O(N) to retrieve record!
  • Open addressing – circular buffer – i.e. in case of collisions try to insert record to next un-occupied cell – which lead to poor performance in case of bad choice of hashing function. Load factor – how it will perform if 25\50\80 % of cells of array will be occupied. Used for hashtable implementation in python.

Separate chaining advantages:

  • Easy implementation of delete method
  • Performance degrades gracefully
  • Clustering is less sensitive to poorly designed hash functions

Linear probing advantages:

  • Less wasted space (pointers)
  • Better cache performance

There are various use cases & requirements for hashing that affect characteristics of good hash function:

  • Quickly insert\return records with mapping Key to Value – cheap cost of computing hash and low number of collisions
  • Find similar or close entries – need to maximise collisions if keys meet some criteria – locality sensitive caching (for example GeoHash) with rolling hash to speed up computations, another famous example: Rabin Karp string search algorithm.
  • Make it difficult to find key corresponding to particular hash value – one way function – Cryptography related hashing – fixed size output for any input, no collisions and at the same time almost impossible (time and computation consuming wise) to find input that produce particular hash
  • Cornerstone of more advanced data structures: bloom filter, merkle tree, distributed hash table.
  • Interesting overview of details of hash implementation in various languages can be found here

Heaps

Root always contains element that satisfy heap property: min-heap or max-heap. Elements not sorted but partially ordered.  Heaps may be constructed in O(N) using Floyd algorithm. Usually used as underneath structure for priority queues. Leftist heap as example of pure functional data structure.

Heap’s underneath datastructure Time complexity of operations Comments
Find Max Extract Max Increase Key Insert Delete Merge
Linked List (unsorted) O ( N ) O ( N ) O ( 1 ) O ( 1 ) O ( 1 ) O ( 1 )  
Array (sorted) O ( 1 ) O ( 1 ) O ( N ) O ( N ) O ( 1 ) O ( N + M )  
Binary Heap O ( 1 ) O (Log N) O (Log N) O (Log N) O (Log N) O (N + M) Heapify: O(N)

 

Binary Heap (array representation) – ordered complete binary tree – parent’s key no smaller than children’s key. Complete binary tree.
Binomial Heap O (Log N) O (Log N) O (Log N) O (Log N) O (Log N) O (Log N) A binomial heap of order K has a root whose children are roots of binomial trees of orders K-1, K-2, … , 2, 1, 0 – in this order.
Each binomial tree is a heap obeys the minimum-heap property: the key of a node is greater than or equal to the key of its parent.
There can only be either one or zero binomial trees for each order, including zero order.

Support quick merging of two heaps.
Fibonacci Heap O ( 1 ) O (Log N) O ( 1 ) O ( 1 ) O (Log N) O ( 1 ) Collections of trees.
Used for priority queue.
Degree of nodes (degree means the number of children) are kept low;
every node has a degree at most O(Log N) and the size of a subtree rooted in a node of degree K is at least F(K) + 2, where F(K) is K-th Fibonacci number.

Amortised constant factor affect practical usage. High memory consumption.

Tree

Full tree – every node except leafs have non-null children. Complete tree – full tree that have all nodes at the last level are lean left.

Tree traversal:

  • Pre-order – via dfs
  • In-order – via dfs
  • Post-order – via dfs
  • Level-order – via bfs

Search data structures

  • Size of element (read as cost of access\copy\compare operations)
  • Volume of data: can it fit to RAM?
  • Usage patterns:
    • evenly mixed read & writes or skew for particular operations?
    • access single key or range?
    • access to value using the whole key or interested to find all values where even parts of key matches?
Data structure Average time complexity Worst time complexity Auxiliary Space Comments
Indexing Searching Insertion Deletion Indexing Searching Insertion Deletion
Arrays O ( 1 ) O ( N ) O ( N ) O ( N ) O ( 1 ) O ( N ) O ( N ) O ( N ) O ( N ) Cache friendly, low memory overhead.
Pre-allocated vs dynamic.
Linked list O ( N ) O ( N ) O ( 1 ) O ( 1 ) O ( N ) O ( N ) O ( 1 ) O ( 1 ) O ( N ) Size of data vs size of pointer(s).
One directional vs bi-directional vs circular.
Sequential access vs random.
Internal or external storage (i.e. node contains only reference to actual data)
Foundation of stack, queue and dequeue.
Skip list O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( N ) O ( N ) O ( N ) O ( N ) O ( N Log N) Allow fast search using additional references hierarchy: next, the next after next, every 4th, i-th level – should contain 2 ^ i elements, but it used probabilistics skips – i.e. at i-th level, next reference may point to 2^i element or further.
Alternative for hash table and balanced tree – theoretical worst case performance can be worse, but in practice usually faster – depend on nature of data and cost of comparison operator.
Foundation of various lock-free implementations for dictionaries and priority queues. Used in Redis, LevelDb and Lucene.
Hash table (symbol table, associative array) ? O ( 1 ) O ( 1 ) O ( 1 ) ? O ( N ) O ( N ) O ( N ) O ( N ) Not best in case of small number of keys in comparison to direct mapping key -> array index. In practice, particular operation can be slow because of array resizing and cache un-friendly nature may lead to poor performance in comparison even with O(N) brute force lookup of simple arrays. C++: unordered_map, multimap
Python: dict
Java: HashMap
Binary Search Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( N ) O ( N ) O ( N ) O ( N ) O ( N ) Complete if it perfectly balanced, except for a bottom level. In worst case (depend on insertion order) it can be N-height. Constructing as quick sort. Application of problem: 1D-range search.
Cartesian Tree ? O ( Log N ) O ( Log N ) O ( Log N ) ? O ( N ) O ( N ) O ( N ) O ( N ) It is heap-ordered and symmetric (in-order) traversal of the tree returns the original sequence. Used for RMQ (range minimum query) kind of tasks.
B-Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( N )

Not just two child anymore => heigh of tree lower => lookup faster.

Various modifications.


B+-tree – actual data stored within leafs, intermediate nodes contains copy of keys.
Core data structure for relational databases as it allow to minimise disk seeks –
implemented in Berkeley DB.

2-3 Tree O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) ? Maintain symmetric order and perfect balance.
Red-Black Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O( N )

Balance is preserved via painting different nodes in Red\Black colours to satisfy some predefined properties. Re-arranging and re-colouring can be performed efficiently.

For example: no node has 2 red links connected to it. Red links lean left. Every path from root to null link has the same number of black links – perfect black balance.

C++: set and map.

Java: TreeMap and TreeSet (and in some situation HashMap fallback on their usage as well!)

Splay Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O( N )

Splay Tree is a self-adjusting binary search tree with additional property that recently accessed elements are quick to access again.

Applications: cache, garbage collectors, memory allocators.

AVL Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O( N ) Balance in AVL Tree – the heights of two child subtrees of any node differ by at most one. AVL Tree better for look-up intensive applications than Red-Black Trees
Kd Tree ? O ( Log N ) O ( Log N ) O ( Log N ) O(N Log N) O ( N ) O ( N ) O ( N ) O( N )

Space partition Binary Tree. Used for range search and nearest neighbour search. Constructing N Log N. Not suitable for high dimensional space – better to use when N >> 2^k.

Octree for 3D space.

More advanced structures worth to be aware of:

  • Trie (Prefix Tree) and its variations, mainly Radix Tree for lookup intensive applications. Worst case – O(M), where M is the length of string to search. DAFSA – can be a great alternative for cases with limited memory.
  • Generalized Suffix Tree – search efficiently with mismatches, applications for various string related tasks: longest common substrings, longest repeated substrings, longest palindromic substring,
  • Fenwick Tree (Binary indexed tree)– common use case: modify element of array and return result of some inverse operations for range of elements (for example partial sum but not max). Simpler and faster in comparison with segment trees. O(N) build time.
  • Segment Tree – classical approach for RMQ kind of tasks, operation not necessary to be inverse. Double memory requirement in comparison to fenwick tree. Constant factor affect speed in practice, especially in cases of multiple dimensions.
  • Implicit Treap – sequence of pairs, heap ordered by 1st entry, BST ordered by the 2nd. If we use index of element in array as first entry – than we can
    • insert\remove element at any position
    • move subset of elements in any position
    • computation of some function for elements within some interval [i, j] – sum, min – including range updates using lazy propagations
    • foundation for Rope implementation – to have similar asymptotic for very long strings
    • and all of the above in O(Log N) – with huge constant factor though
  • vEB Tree – ordered dictionary or associative array much faster (exponentially) than BST, memory overhead
  • Heavy-Light Decomposition: split tree on multiple not-crossing paths. Edge from A to B is considered heavy if number of vertexes reachable from B is at least half as big as number of vertexes reachable from A. Heavy-Light Decomposition – union of light vertexes from bottom of tree to the root. Such decomposition allows us to serve queries like “Get some metric on the path between A and B” – for example sum or max. Not needed in case we do not have update queries. Usually used with segment tree and LCA.

Union Find (Disjoint set)

Approaches: colouring, trees on top of arrays using ranks\randomisation\path compression & merging either randomly or using ranks of trees. Additional information can be stored with leaders of set.

Applications: offline LCA, offline RMQ, dynamic connectivity variables dependencies during program analysis, variations of percolation problem – connections in social network, fluids or electricity flow.

Graph

Representation: incidence matrix vs adjacency list. Generally adjacency list is preferable for sparse graph, but there are variation of incidence matrix based on associative array where pair of vertexes used as a key.

Possible variations:

  • Complete
  • Directed
  • Weighted
  • Planar
  • Bipartite 

Cut – set of vertexes from graph, if they will be removed – graph become disconnected. Size of min cut – connectivity or K-connectivity.

Graph related tasks

  • Find cycles in graph – union find (complexity – Akkerman function from number of vertexes) or DFS – proportional to number of vertexes
  • Determine whether graph is bipartite – dfs
  • Eulerian path – route visiting every edge exactly once (exist if all vertexes has even degree and it is connected) – search of such path can be done in linear time O(|E|)
  • Hamiltonian path – route including every vertex exactly once (as in Travelling salesman problem)
  • Comparison of graph – graph’s isomorphism (still open question N or NP)
  • Graph representation in such way that there are no edge crossing, except in vertexes (upward planar graph)  – there are some special cases where such checks can be done in O(N) but generally it is NP-hard problem. 
  • Minimum Spanning Tree (MST) – application is design of network (For example to avoid cycle of packets in Ethernet) or clustering, with custom distance function.
    • Prim – adding edge to current vertex with lowest weight
    • Kruskal – gradually merge trees from all vertexes using edge with lowest weight for connection

 

For directed graphs:

  • Topological sorting – tasks with ordering limitations, schedule related tasks (DFS) – create schedule in case we do not have cycle – re-draw graph in such way that all path follow from bottom to top – reverse dfs postorder (additional stack) = result
  • path finding – Dijkstra, if weights are negative – Bellman-Ford or its optimised version – Shortest Path Faster Algorithm (SPFA)
  • Shortest path for multiple sources – dense graph with negative weights – Floyd-Warshall or not dense – Johnson’s algorithms or Dijkstra 
  • Search for cycle – DFS – detection of deadlock in concurrent systems or topological sort
  • Maximum flowFord–Fulkerson algorithm , Bipartite matching problem, baseball elimination
  • Dynamic connectivity
  • 2-SAT – satisfy system of constraints of boolean variables – scheduling, medicine testing. Can be represented as directed implication graph. Checking whether two vertexes are within the same connected component allow to answer question whether we can satisfy all conditions.

Substring search

  • Knuth-Morris-Pratt (KMP) – based on DFA (deterministic finite state automaton) that have to be pre-computed for pattern. O(R + N) where R – length of pattern and N – length of text. O(RN) space.
  • Boyer-Moore – scan from the end of pattern and in case of mismatch skip up to R characters – worst case is as brute force O (R * N) but on average considered the most efficient algorithms – O (N / R).  O(R) space.
  • Rabin-Karp – compute sliding hash for every substring of length R for hashing can use modular hashing with some prime number. For efficiency sake Horner’s method can be utilised to re-use computation from previous steps – O(R). Hash match – still can be a collisions so there are two variations in case of hash matches. 2D search or search for multiple patterns using pre-computed table of hashes. O(1) space.
  • Aho-Corasick – in case we are dealing with multiple patterns and text where we search them are different. Build trie for all words, extend it to special finite state machine without backtracking – for failed patterns using suffix links. Used in grep – -F flag.
  • Z-functions: Given string A, consider array of all possible suffixes: A[len(A)], A[len(A) – 1], … , A[1]. Compare symbols from the start of string and every suffix – number of matched symbols – represent particular component of Z-function. For sub-string search: concatenate pattern, unique delimiter and original string and check if values of Z-function equal to length of pattern.

Playgrounds for preparations

Собеседование на программиста в 2019 – к чему готовиться

Осень-зима 2018-2019 года выдались у меня на редкость насыщенными: ориентировочно, с октября по февраль я поучаствовал в доброй сотне собеседований на позиции уровня Senior-Principal-Lead. Четыре из них были из России (ну мне честно было интересно как тут у нас сейчас), остальные Европа и Англия. Куда-то дальше по разным причинам не хотелось – хотя и пообщался с парой компанией из солнечной Австралии. В этой заметке хочется поделиться накопленными наблюдениями для тех кто задумывается о смене работы, в первую очередь, ориентируясь на иностранные фирмы. Сначала немного сухой статистики, а потом личные выводы.

Первый контакт: процентов 70 представители компаний или рекрутеры находили по профилю в linkedin. С остальными связывался сам – преимущественно откликаясь на любопытные мне вакансии со stackoveflow jobs\hacker news who is hiring\angel.co.

Этапы собеседований

Знакомство

Опциональная беседа с ичаром компании или рекрутером. Цель определить что вы:

  • можете связать пару слов по английски (здесь будет полезно подготовить кратенький монолог на тему “последние 5 лет моей карьеры”)
  • прояснить круг обязанностей по позиции – надобно только писать, помогать писать другим или перед всем этим сначала еще обговорить с клиентами что собственно делаем
  • обсудить желаемую локацию – были случаи когда указаная страна внезапно менялась на другую, например по причинам визы или так как самые hardcore разработчики исторически осели там
  • удостовериться что вы не страдаете особыми закидонами в плане общения.
  • в ряде случаев – определить уровень компенсации дабы не было в конце взаимного разочарования из-за бездарно потраченного времени. Это бывает удобно когда у компании строго определенный бюджет и выйти за него они не могут, а вам меньше тоже не особо интересно. После того как первый раз мне выкатили офер процентов на 40 ниже того что я ищу – в не очень именитые фирмы этот вопрос поднимал всегда до каких либо технических собеседований.
  • обсудить релокацию – что по визам (особенно сейчас актуально для Штатов), оплачивают ли жилье на первое время – и сколько его этого первого времени, мед-страховка для членов семьи, предоставляется ли контейнер на случай если вы желаете переехать с любимым диваном и коллекцией пивных кружек, помогают ли с сопутствующей бумажной волокитой и поиском жилья.

Tech screening – первая кровь

Компании высшего эшелона – гуглы, фейсбуки, амазоны и подобные им, на первом этапе презренные деньги не обсуждают, однако въедливо скринят по нескольким тематикам (в зависимости от позиции, если программист – то обязательно алгоритмы + предметная область – специфика вашего любимого языка, математика, машинное обучение, сети, потроха линуксов и т.п.) – дабы удостовериться что стоит на вас спускать инженеров. Встречаются и небольшие амбициозные компании которые норовят устроить экзамен на знание акронимов из вашей предметной области.

Подавляющие же большинство предпочитают полагаться на автоматизированные тесты на https://hackerrank.com, https://www.codility.com/ . Обычно это 1-3 задачки которые надо решить (написать код, запустить и удостоверится что проходят все тесты) в условиях неотвратимо цокающего таймера. Задачки варьируются уровнем от чисто алгоритмических (легкие\средние с https://leetcode.com/), до а вот запилите нам простейший блокчейн или каталогизатор фоток. Корректность решения проверяется заготовленными тестами на самой платформе. А как ваш код будет работать если вместо int придет long? А что если в последовательность положительных чисел вклинится одно отрицательное? Тесты вам могут быть доступны все (т. е. вы запустили – система показала какой тест завален – вы скачали данные для проверки), доступны частично (т. е. вы видите что из 100 тестов, 99 failed а почему и что там может быть – додумывайте сами), а могут быть и не показаны вовсе (считаете что все красиво – жмакайте кнопку “я сделалъ”). Таймеры у всех разные – у Amazon было два часа на две нудные длинные задачи с регэкспами, у Yelp 20 минут на одну. Могут быть и вопросы с вариантами ответов как у Optiver или даже практикум на живой виртуалке с линуксом как у Booking.

Некоторые предлагают заполнить форму с вопросами – аля как мы можем использовать технологию X, или программа на языке Y работает медленно – опишите как вы будете ускорять ее. Здесь убивают сразу двух зайцев – проверяют на базовое знание матчасти (вопросы обычно классом поглубже чем поверхностное гугление) и ваш талант кратко и по делу донести свои мысли деловым английским.

Online assessment – онлайн собеседование

Если удалось справится с тремором по поводу гонок со временем – то второй этап либо онлайн кодинг, либо так называемый take home (задачку на дом) и его последующее обсуждение.

В первом случае вас ожидает несколько сессий где вас попросят написать что-то на алгоритмы и обсудить вариации решения, подкидывая на ходу дополнительные условия, чтобы оценить насколько расширяемо вы пишете код. Большинство не парится на тему конкретного языка и заинтересовано в первую очередь в верном алгоритме, исходя из теории был бы человек хороший – а принятый в компании стек можно и освоить. Примеры задач – в добавок к упомянутым https://hackerrank.com, https://leetcode.com, еще можно посмотреть на https://careercup.com. Сложность и количество задач всегда зависит от двух факторов: позиция и ваша синьорность – вполне могут дать одну посложнее вместо пары-тройки легких.

Задачка на дом представляет собой уже что-то более-менее приближенное к реальности – как правило это уже тот стек на котором вам предстоит ударно трудиться. Здесь приветствуются авто-тесты, авто-документация и все best practices применимые к языку системе сборки и документации которые так уютно существуют в самых смелых мечтах разработчиков. Код обычно размещается в приватном репозитории чтобы ваши конкуренты не повторяли ваши ошибки, а делали свои, а работодатели не ломали голову над очередной дилеммой: как бы и кандидата проверить в около-боевых условиях, и, при этом, ему не надо убивать на это более двух вечеров. Далее назначается дата и обсуждаются результаты – то ли вы вообще решали (а у меня был случай что я в попыхах не верно интерпретировал условие), решили ли (опять же был весьма досадный опыт когда поверхностно проверил не на всех наборах данных и пропустил фатальную жирную багу) насколько код готов быть развернут на боевом сервере – много обращений, много данных, низкое качество данных – в параллели заодно затрагиваются детали языка\фреймворка\стиля.

Onsite interview aka face to face

В случае успеха остается последний рубеж. Компании попроще на этом этапе ограничиваются общением с непосредственным руководителем и снова HR – behavioral interview, cultural fit – дефицит кадров еще не значит, что кто-то хочет попрать священное правило “No asshole!“. Некоторые примеры вопросов:

  • Опишите ситуацию когда вы облажались – какие уроки вы извлекли из этого
  • Как вы считаете какие ваши сильные и слабые стороны назвал ваш предыдущий начальник\коллеги\подчиненные
  • Что вы будете делать если почувствуете что у вас напряженные отношения с кем то в команде
  • Что вы будете делать если не согласны с каким-то техническим решением? Что если оно уже утверждено?

Те же у кого финансы позволяют – могут пригласить вас в офис чтобы вы пообщались на отвлеченные темы с представителями из разных команд, воочию посмотрели как проходит обычный рабочий день – как происходит взаимодействие между людьми. Это может внести некоторую заминку из-за необходимости получения визы – особенно в случае если вы не из Москвы.

Hard mode: on

Вдобавок к выше перечисленному если компания большая, а имя ее заставляет рекрутеров делать стойку и дает вам возможность добавлять себе в титул приставку ex-Имеряк’ер – тогда тут-то и начинается самое интересное. Целый день вы проведете в офисе компании. С вами будут говорить такие же технари – представители нескольких смежных отделов. Несколько секций с решением уже знакомых алгоритмических задачек – подтвердить продемонстрированный ранее уровень, но уже в окружении скучающих интервьюров на доске, листе бумаге или хромбуке. Сессия вопросов и ответов (с вас) по специфичным для позиции темам. Специализированные интерактивные секции – (тут вы должны постоянно задавать наводящие вопросы определяя ограничения задач) – спроектировать архитектуру некоторого приложения, затраблешутить багу в проде. Это будет насыщенный, но не самый легкий день в вашей практике.

Далее уже не так интересно – торг по зарплате, обсуждение условий релокации и бесконечная бумажная волокита.

Из забавного

Из-за перепутанных часовых поясов пришлось отвечать на вопросы по телефону во время пробежки с псом, между заходами в парилку в бане и на первой попавшейся парковке из машины.

В ОАЭ всех претендующих на работу на госслужбу, кроме всего прочего, тестируют с помощью платформы TalentQ – первая попытка прохождения заставила меня крепко зауважать всех госслужащих этой страны. Похожий тест – Criteria Cognitive Aptitude Test (CCAT) – просят пройти, например, в Crossover до любого рода технических тестов. Вопросы там разбиты на три подгруппы: логические – найти лишнюю крякозябру в списке, подставить недостающий петроглиф в строку; около математические – на основе таблицы с кучей цифр подсчитать арифметическое среднее или разницу в процентах чего-нибудь с чем нибудь; прочитав пару абзацев текста выбрать основную мысль или чью-то точку зрения – в простых случаях найти антоним или синоним к какому-нибудь заковыристому слову.

В одном из заданий на дом требовалось вытаскивать некоторые публичные данные и, не смотря на тривиальность задачи, сервер упорно плевался таймаутами. Я вообразил хитрую защиту от пауков и с вдохновением принялся рандомизировать агента, развернув сервис в локали, благо он был заопенсурсшен, проверял CORS и куки. Апофеозом был анализ tcp дампа в wireshark‘e и запоздалая проверка traceroute к сайту. Выяснилось что сайт хостился на айпишнике из забаненных диапазонов, а во всех браузерах были vpn плагины.

Иногда по результатам рассмотрения резюме присылали отказ, а спустя пару недель рекрутер той же самой компании зазывал пообщаться.

На дату публикации – самая длительная пауза между откликом и ответом составляет три месяца. Победители в этой номинации запросили целый набор документов (ASAP!11) для того чтобы запланировать интервью.

Самопозиционирование как architect – иногда вводило рекрутеров в заблуждение и предлагались позиции связанные с очередной стройкой века в некоторых странах Персидского залива.

Некоторые выводы и наблюдения

https://www.timeanddate.com/worldclock/converter.php – ваш друг и помощник – когда самая главная головная боль – понять когда у кого сколько времени

Растет число компаний предлагающих удаленную работу по рейтам не зависящим от вашей локации. Задачи наконец-то начинают отличаться от “in-house сотрудники не хотят возиться с унылым легаси” до вполне cutting edge – дизайнить дата пайплайны для финансовых платформ, SRE\DevOps для вполне серьезных хайлоадов, ML – для предсказания цен и рекомендаций в разных отраслях.

Планирование сроков – собеседований и рассмотрения предложений – позволяет задать и себе и компаниям осязаемые дедлайны. Это дисциплинирует.

График собеседований лучше составить так, чтобы сперва для разминки вы пообщались с мало-интересными вам компаниями, а уже потом, на пике формы, переходили к компаниям мечты (но еще не успели выгореть от бесконечных “есть два целочисленных массива … “).

Звонков и вообще всяческой суеты от рекрутеров будет много – они будут пропадать, внезапно звонить спозаранку или ночью, вы вполне можете путать их имена и компании которые они представляют. Собеседование это тоже работа и тут вы, чем сможете, должны им помогать, понять и простить. Хороший рекрутер может помочь вам с согласованием предлагаемого работодателем пакета и выбить комфортные условия релокации, или предложить что-нибудь интересное через год когда вы уже напрочь о нем забыли.

Российские компании (из тех с кем я общался) куда более зациклены на конкретном стеке кроме, пожалуй, Яндекса. Уровень вопросов вполне сопоставим с общемировым. С учетом налогов и даже без учета стоимости жизни в Москве и Питера можно получать больше чем в Европах, даже без менеджерских лычек.

Как ни странно – надо готовится. И тут чудес не бывает – брать Кормэна, смотреть лекции Седжвика и в виме простом блокноте писать на том языке на котором планируется проходить собеседование. Для философских разговоров хорошо заходят Architecture of open source applications про хайлоад – DDIA, для SRE – Sire Reliability Engineering и моя любимая книжка с черепашкой – Операционная система Unix Робачевский.

P.S. И просто побрюзжать

Современные IDE отупляют разработчиков – иначе я не могу объяснить причину появления сонма фундаментальных серьезных руководств как писать программы в текстовых редакторах (без авто-комплита, подсветки, подсказок и возможности скомпилировать или запустить программу чтобы высветился номер строчки с ошибкой).

Я еще помню времена, когда вопросы частенько задавались заведомо не правильные, на знание стандартов C++, undefined behaviour, depend on compiler implementation. Сейчас, к моему удивлению, ряд вопросов служат тому, чтобы выяснить что вы будете делать если нет интернета или стековерфлоу лежит, а с кодом надо что-то решать.

Иногда формализм побеждает – когда спрашивающий общий вопрос хочет услышать свой любимый ответ или занимается проверкой вашей памяти, задавая вопросы на терминологию (многие бы сильно удивились узнав что они используют какие-то из известных паттернов проектирования или определенные нормальные формы в БД). Из запомнившихся вопросов подобного плана – CAP теорема, что значит каждая буква в SOLID или перечислить те самые двенадцать факторов.

Коррелирует ли как-то способность решить задачу про кроликов или свести проблему Пингвин-Авиа к поиску минимального остовного дерева за полчаса интервью с тем как кандидат будет работать – не знаю. Поможет ли вам знание о существовании неявного декартого дерева в работе – сильно зависит от того над чем вы будете работать. Расширит ли алгоритмическая подготовка ваше сознание – несомненно!

Собеседования с вменяемыми компаниями (не ленящихся прочитать резюме, без подсказок дающих обратную связь) – прекрасный способ держать себя в тонусе. 🙂

[docker][cassandra] Reaching mixed load – 750,000 op\sec

The cart goes nowhere because the swan wants to fly in the air, the pike wants to swim underwater and the crawfish wants to crawl backward.

Cassandra performance tuning - challengeCassandra is one of powerhorse of modern high load landscape – nosql database that can be pluged to Spark for distributed computing, Titan for playing with graph data representation and even to Elastic as a search backend.
And if you really care about pure write performance – this is de-facto choice in world of open source solutions: production proof, with big community that already generated many outdated controversial answers at SO & mailing lists.

 

No single point of failure, scalability, high availability, retention periods, … , but those marketing claims hide few principal caveats… Actually, cassandra has only single drawback(*) – it will not reach its limits with default settings on your hardware. Lonely, single node configuration – it is not use case for cassandra, it will shine in multinoded clustered setup.

If you really want to see full utilization of endless cores and crazy amount of RAM, you have to use some virtualisation technology to manage hardware resource.

Let me start first with some conclusions and recommendations, based on extensive two monthes testing and observartion of trouble tickets after migration of this approach to production. With those considerations in mind I was managed to configure it such way to tolerate 750 k mixed operations per seconds. It was generated for more than 8 hours to check pressure tolerance and emulate peak loads. .It was mixed execution of async inserts, without future processing and synced inserts as well as read requests.

Frankly speaking, I am sure it is still far from its limit.

Bear in mind that I am talking about Cassandra 2.1.*.

About disks

  1. Use ssd disks as mapped volume to docker container. Single container = single dedicated disk.
    It is possible to use multiple disks per containers, but it will lead to 5-15 % of slowdown.
  2. If you use ssd disk you can map all casandra directories to it (saved_cache, data, commit_logs) and adjust casandra.yaml with higher values of throughput, in particularly: compaction_throughput_mb_per_sec, trickle_fsync
  3. It is really depends on data distribution and your data model, but be ready that disk utilization will vary from one node to another up to 20%
  4. Docker should be configured to NOT use host’s root partitions. Don’t be mean and allocate single drive for logs and choose proper storage driver – docker-lvm.
  5. In practice, cluster start strugling when any of nodes come out of space. Surprisingly, in my experiments it was stable even with 3% free, but in real life better to configure your monitoring system to give alert at 15-20%.
  6. Choose compaction and compresion strategies wisely when you design your db
  7. Be careful with column naming – it wil be added for every god damn row!
  8. Do sizing when you think about number of nodes (and disks).

About cpus:

  1. More cpu per node is NOT always good. Stick with 8 cores per node.
    I’ve experimenting with single fat supernode per physical server = 48 cores, 4×12, 6×8.
    6 node with 8 cpu cores outperform all others in 6 kind of stress load scenarious.
  2. If you play with core number you have to adjust few settings at cassandra.yaml to reflect that number: concurent_compactors, concurent_reads, concurent_writes.
  3. Cassandra in most cases endup to be cpu-bound, don’t forget to left for host system 8-16 cores, and allocate cpu exclusivly for containers using –cpuset-cpus

About RAM:

  1. cassandra-env.sh have builtin calculation of free memory to adjust jvm settings using analysing results of command free. Ofcourse it is not for docker based setup. Bear this in mind and tweak your startup scripts to substitue values there.
  2. Disable swap within docker using –memory-swappiness=0
  3. Effectiveness of memory usage depend on cpu amount, how effective multithreaded compaction is implemented at Cassandra and what settings for reader\writer\compactors you have at your cassandra.yaml, i.e. you can have hundreds of RAM but endup in OOM. But even with 8 Gb of RAM per node you already can see benefits. More RAM – mean more memtables, bigger key cache, and more effective OS-based file caching. I would recommend have 24 Gb RAM per node.
  4. Disable huge page at host system or at least tune your jvm settings:
 echo never &amp;amp;gt; /sys/kernel/mm/transparent_hugepage/enabled
echo never &amp;amp;gt; /sys/kernel/mm/transparent_hugepage/defrag 

About network

  1. Mandatory to use network stack of host Os using flag at docker –net=host
  2. Most likely network should not be bottleneck for your load, so you can stick with virtual interfaces on top of single real one.

Testing:

  • 3 physical server: each have 72 cores, 400gb ram
  • Cassandra 2.1.15
  • Docker: 1.10
  • Host Os: Centos 7.5
  • Guest Os: Centos 7.5
  • java 8 from oracle with jna

Cassandra 3.* this is competely another story – in my opinion, mainly, because of storage engine changing, but here is a huge list.

DB overview:

  • Dozen keyspaces, each have up to 20(?) tables.
  • Few indexes – just do not use indexes, design schema properly
  • Data replication = 3, gossip through file
  • Each physical server represent dedicated rack within single datacenter.
  • Row cache were disabled at cassandra.yaml i.e. first priority was to focur on write oriented workload

Tools:

  1. datastax stresstool, artificial table – very intresting, but useless, using your schema is very important
  2. Datastax stresstool + your own table definition – nice, give hints of production performance. But you still testing single table – usually it is not the case in real life.
  3. Self written in-house stress tool that generate data according to our data model in randomized fasion + set of dedicated servers for ddos with ability to switch between async inserts (just do not use batches) with and without acknowledgment.
    Once again: no batch inserts as they should not be used in productions.
  4. Probably, you can adapt Yahoo! Cloud Serving Benchmark. I haven’t played with it.

 

That’s it folks, all craps below is my working notes and bookmarks.

How to get c++11 at Centos7 for stress tool compilation:

Install recent version of compiler on centos7: devtoolset-4
update gcc version 4.8 at centos 6: https://gist.github.com/stephenturner/e3bc5cfacc2dc67eca8b

scl enable devtoolset-2 bash

RAM & swap:

How to clean buffer os cache

echo 3 &amp;amp;gt; /proc/sys/vm/drop_caches
check system wide swappiness settings:
more /proc/sys/vm/swappiness

Docker related:

If you are not brave enough to play with DC\OS or Openstack you can find docker-compose to be usefull for manipulation of homogeneous set of containers

Installation:

sudo rpm -iUvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
rpm -qa | grep docker

If you fucked up with partition settings:

wipefs -a /dev/sda1

Docker and mapped volumes: http://container-solutions.com/understanding-volumes-docker/

If docker info | grep loopback show you something – you already screw up configuration of storage driver.

How to check journal what is happening:

journalctl -u docker.service --since "2017-01-01 00:00:00"

Full flags described here.

Usefull commands for check docker images:

docker inspect
lvdisplay
lsblk
dmsetup info /dev/dm-13

Cassandra:

How to check heap memory consumption per node:

nodetool cfstats | grep 'off heap memory used' | awk 'NR &amp;amp;gt; 3 {sum += $NF } END { print sum }'

How to check what neighbors we can see from nodes:

nodetool -h ring

How to find processes that use swap:

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | awk '$2 {print $1 FS $2}'

Check how many disk memory we used, from Cassandra perspective

nodetool cfstats | grep 'Space used (total)' | awk '{s+=$NF} END{print s}'

Determine disk usage, OS point of view:

du -ch /var/lib/cassandra/data/

cassandra check health:

 ssh nodetool status

Network:

How to open port at centos 7:

firewall-cmd --zone=public --add-port 9160/tcp --permanent
firewall-cmd --zone=public --add-port 9042/tcp --permanent
firewall-cmd --zone=public --add-port 7200/tcp --permanent
firewall-cmd --zone=public --add-port 7000/tcp --permanent

Open ports for spark, master:

firewall-cmd --zone=public --add-port 7077/tcp --permanent
firewall-cmd --zone=public --add-port 8081/tcp --permanent

Apply changes in ip tables:

firewall-cmd --reload

or do this, usefull in case network manager behave badly:

systemctl restart network.service
systemctl status network.service

And as bonus points –

How to migrate data from old cluster to bright new one

  1. sstableloader
  2. cassandra snapshots
  3. For tiny dataset to get cql file with inserts: – cassandradump

First two approaches represent standard way of data migration.
Limitation of first is speed and necessity to stop old node.
Limitation of second is necessity to manualy deal with token_ring on per node basis.

If life was really cruel to you, you can play with data folders per node.

NOTE: if you can replicate exact same setup – in terms of assigned ip, it will be enough to just copy cassandra.yaml from old nodes to new one, and use exact same mapping folder within docker as it were at old cluster.

If not – you still can do it with copying data folder follow steps below, but better just use sstableloader.

  1. In order to do it you have to run following command on every node to drain node from cluster and flush all data into filesystem:
nodetool drain</pre>
<pre>

NOTE: this is unofficial, not recommended way to deal with data migration.
NOTE 1: it require you to have similar amount of nodes in both clusters
NOTE 2: no need for the same datacenter\rack cfg\ip address

2. Deploy docker based setup according to HW configuration. Total amount of nodes should be equal to total amount of nodes at old cluster. On new cluster deploy exact schema that were deployed on old cluster.

3. Stop new cluster.
within every node data folder of OLD cluster you would have following folders:
system
system_traces

NOTE: do not touch system* tables.

4. Under folder /your/cassandra/data-folder/your-keyspace
you should have set of folders corresponding to that keyspace under which data is stored.

5. You have to copy content of this folder (*.db, *.sha1, *.txt) for every node from OLD cluster to corresponding folder of NEW node cluster in. UUID WILL be different.
I.e. old cluster, node 1 to new cluster, node 2:
data copy example
scp /old/cluster/cassandra-folder/data/your-keyspace/your-table-e31522b0e2d511e6967a67ec03b4d2b5/*.* user@:ip/new/cluster/cassandra-folder/data/your-keyspace/your-table-c56f4dd0e61011e6af481f6740589611/

6. Migrated node of OLD cluster must be stopped OR you have to use `nodetool drain` for processed node to have all data within sstables ~ data folders.

Performance monitoring:

  • general system overview: atop or htop
  • Be sure that you understand memory reporting.
  • JMX based monitoring: jconsole
  • Jconsole connection string: service:jmx:rmi:///jndi/rmi://:/jmxrmi
  • dstat network & disk io
  • strace – show every system call. slow down. can connect to running.
  • netstat -tunapl | lsof -i -P – network\ports per process
  • docker stats – reports cpu\mem\io for container
  • perf + perf-map-agent for java monitoring:
    for example cache miss, more there:
perf stat -e L1-dcache-load-misses

Articles that I find usefull:

 

*cassandra has only single drawback – it have no idea of your data model, whether you configure your data schema correctly, what is your load patterns. That why you have to dive in wonderland of controversial recommendations in blogposts like that one instead of thoroughly read documentations first.

 

P. S. do not believe anyone, measure!

 

 

How to stop being a junior – 7 hints of programmer productivity

0) If you don’t know something – don’t afraid to ask.

Especially if you already checked first page of google search and pretty sure that no one ask that question at stackoverflow.
Reinventing the wheel and breaking the stalemate can be a good exercise for your home projects, but in production environment better to check idea with your mentor before diving into implementation details.

1) Don’t afraid to show that you have no idea.

real programming - do not left open questions ever

do not left open questions ever

Just add note for yourself to figure out it later.
If you do not understand how something works – obviously this is gap in your knowledge.
And you can’t just skip it – you are software engineer – this is your obligation to be aware what is happening under the hood.

And yes, sometime you have to do it at your own time.

Professional growth is very simple:

  • step 1 – find something that you don’t know,
  • step 2 – start investigation, discover bunch of additional mysteries and repeat step 1

2) Split problem to several simple questions and address them one by one.

Don’t try to troubleshoot huge multi-component system within real data to reproduce the problem.
Forget for a minute for overall complexity of your enormous project, analyze suspicious functions one by one independently of each others.
Use online fiddle for your language to check obscure part of language with fake data returned by mock api.

3) stop wasting your gorgeous time

If you find yourself googling the same commands over and over again – start making notes.
Create txt file with most useful commands and update it whenever you find yourself googling again.
Search for ready to use cheat sheet or even put wallpaper on desktop.

4) Do not stop to invest time in reading proper books. Ever.

Pressure, deadlines, laziness.
This is not for company, boss or to bragging about.
This is your main and primary investment to the future – your knowledge is treasure.
30 minutes of reading every day – not too much,
but in long run – you will be noticed that you become more capable and be able to tackle previously hard to solve problems.

from junior to senior - bite by bite

from junior to senior – bite by bite

5) You should properly digesting advice, articles and opinions.

Always will be people who are picky about your code.
Always will be deadlines where you have to make compromise.
Always will be people who haven’t seen big picture or just too stubborn to like ideas of other.
Be wise and pragmatic:
first and foremost you earn money for get your job done.
focus on that target along the way try to do it concise and efficient but at the same time meet your own deadlines.
New technologies, new languages, new paradigms and new patterns give it a try only when this shit is working.

6) Do not ever ever ever do work that you do not like or not interested in.

You will do it mediocre at best.
Your work should be your passion, your pride and not amount of hours behind the desk exchanged for paycheck.
If you are not happy at the morning before the workday – you have to change something. Urgently.
Excitement, challenge and satisfaction should be your main motivation for every day.
Money and career opportunities always follows three guys above 🙂

db mix – postgres, sqlite, cassandra, aerospike & redis

SQLite

How to delete all tables from sqlite:

SELECT&nbsp;'DROP TABLE ' || name || ';' FROM&nbsp;sqlite_master WHERE&nbsp;type = 'table';

Find duplicates by field “field_name”:

SELECT&nbsp;field_name, COUNT(field_name) AS&nbsp;cnt FROM&nbsp;some_table GROUP BY field_name HAVING(cnt &amp;gt; 1 ) FIXME - use cnt?

Find records changed in last 5 days:

SELECT * FROM some_table&nbsp;WHERE created_at &amp;gt;= NOW() - '5 day'::INTERVAL;

Get table definitions:

pragma table_info(mass_connections);

Export select query to csv:

.mode csv
.output result_of_query.csv
select * from my_table;
.output stdout&nbsp;

Import data from csv into fresh new table:

.mode csv
.import /path/to/your/all_data.csv&nbsp;new_table

Postgres

How to show all tables with sizes within database

SELECT schema_name, relname, pg_size_pretty(table_size) AS size, table_size FROM ( 
SELECT pg_catalog.pg_namespace.nspname AS schema_name, relname, pg_relation_size(pg_catalog.pg_class.oid) AS table_size 
FROM pg_catalog.pg_class 
JOIN pg_catalog.pg_namespace ON relnamespace = pg_catalog.pg_namespace.oid) t 
WHERE schema_name NOT LIKE 'pg_%' ORDER BY table_size DESC;

Show average amount of records per table

SELECT schemaname,relname,n_live_tup&nbsp;FROM pg_stat_user_tables&nbsp;ORDER BY n_live_tup DESC;

How to create data-only dump:

pg_dump -U your_pg_user&nbsp;-h pg_ip_address&nbsp;-p pg_port&nbsp;-a --column-inserts db_name&nbsp;&gt; file_name.sql
pg_dump -U your_pg_user -h pg_ip_address -p pg_port&nbsp;-a --column-inserts --table=table_name&nbsp;db_name &gt; file_name.sql

Useful pg_dump flags:

  • – C adds the CREATE statements
  • – s dump schema only
  • – a dump schema & data
  • – D dump using inserts (to simplify uploading data from PG into another db engine)

How to restore data:

psql dbname &lt; infile.sql

PG stop/start:

$(PG_HOME)/bin/pg_ctl -D /data stop -m immediate
$(PG_HOME)/bin/pg_ctl start -D /data -l logfile

Aerospike:

Get settings:

asinfo -v 'get-config:'

Set particular settings:

asinfo -v 'set-config:context=service;batch-max-requests=10000000'
asinfo -v 'set-config:context=network;timeout=10000000'
asinfo -v 'set-config:context=service;batch-index-threads==100'

How to register LUA-script:

set LUA_USERPATH '.'
register module 'your_script_name.lua'

more at http://www.aerospike.com/docs/guide/aggregation.html

How to build secondary index based on bin

CREATE INDEX _idx ON . (bin_name) NUMERIC

bin_name ~ field name

How to delete all records within set:
https://github.com/aerospike/delete-set

Redis:

How to register lua script:

redis-cli script load "$(cat /YOUR/PATH/script_name.lua)"

Cassandra

How to save results of query to the file:

cqlsh -e"select * from table_name&nbsp;where some_txt_attr='very tricky string';" &amp;gt; cassandra_file_query_result.txt

How to check node health:

nodetool status | awk '/^(U|D)(N|L|J|M)/{print $2}'

How to check compression ratio for particular table:

nodetool -h cassandra_ip&nbsp;cfhistograms some_keyspace some_table

How to check the dropped tasks count (at the bottom) at particular node:

watch -n 1 -d "nodetool tpstats"

How to do a “backup: of  cassandra:

nodetool snapshot

It will generate snapshot of data at /<your path from yaml file>/data/snapshot.

How to do a “restore” from snapshot:

  • restart the server
  • copy data from snapshot to the respective keyspace folder
  • delete content of every keyspace table at /<your path from yaml file>/data/
  • stop cassandra

Kafka

How to show groups:
For old consumers:

./bin/kafka-consumer-groups.sh --list --zookeeper 1.1.1.1:2181

For new consumers:

./bin/kafka-consumer-groups.sh --list --new-consumer --bootstrap-server 1.1.1.1:9092

How to show how many messages:

./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 1.1.1.1:9092 --topic --time -1 --offsets 1

How to show how many messages in ALL partitions:

./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 1.1.1.1:9092 --topic --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}'

How to show topics:

./bin/kafka-topics.sh --zookeeper 1.1.1.1:2181 --list

How to show consumer’s lag:
For old consumers:

./bin/kafka-consumer-groups.sh --describe --zookeeper 1.1.1.1:2181 --describe --group | column -s , -t

For new consumers:

./bin/kafka-consumer-groups.sh --describe --new-consumer --bootstrap-server 1.1.1.1:9092 --describe --group | column -s, -t

How to run test consumer:

./bin/kafka-console-consumer.sh --from-beginning --new-consumer --bootstrap-server 1.1.1.1:9092 --topic

RabbitMQ

How to create exchange and queue:

rabbitmqadmin declare exchange --vhost=/ name=test123 type=direct
rabbitmqadmin declare queue --vhost=/ name=test_queue.123 durable=true
rabbitmqadmin --vhost="/" declare binding source="test123" destination_type="queue" destination="test_queue.123" routing_key="test_routing_key"

HDFS

Updating file access control:

hdfs -setfacl -m user:user_name:rwx /Some/Path
hadoop fs -chown user_name:group_name /Some/Path

Updating file access control recursive:

hdfs dfs -chown -R user_name:group_name /Some/Path

In case of following error:
>> java.io.IOException: Cannot obtain block length for LocatedBlock
Root cause: non-correctly closed files.  How to fix them:

hdfs fsck /Some/Path -files -openforwrite | grep OPENFORWRITE
hdfs debug recoverLease -path /Some/Path/corrupt.json -retries 5

7 sins of blatant ignorance and how to avoid them

…You produce software for some time. In most cases it even works.
Other developers tend to care about your opinion.
Damn, you even wear some fancy title like senior\principal\architect.
And here it is – suddenly you were offered to wear really posh title – CTO…

This is moment when real troubles get started.

I did mistakes by myself. I fought with others to prevent them from repeating of my past vicious moves.
And I wish list below appear every time I was about to make game changing decision.

 

A lot of challenges are awaiting for new CTO. How they can be avoided?

A lot of challenges are awaiting for new CTO. How they can be avoided?

1. Claim-to-be-universal tools suck. Always.

Do not make assumption based on bright hello-world example at promo web site.
In your case – it would be necessary to have some tricky functionality that this mega-framework does not support by design.

2. Be suspicious to any black-box like solution that promise all and everything at no price.

Never-ever expect that you are aware of the deep technical details.

In production, during important demo or pitch – impossible cases tend to jump out regardless of claims from probabilistic theory.
Listen to your team – they are your experts who (should) know about tiny nuances about setup, implementations and limitations.

3. Start simple. Focus on creating _working_ prototype fast.

Forget about speed, scalability, cluster, gpu-computing or “best practices” declared by yet another guru.
In 99.99999 percents of cases you do not need load balancing or advanced caching strategy.
You will iterate if it necessary.

4. Trendy stuff sucks.

Do not waste your time in fighting with bugs of another pre-alpha release of some looks like promising tool.

New database engine \ fresh from research lab language \ trendy paradigm – should be out of your list of consideration.

You need get stuff done. That’s it.
Good old bullet-proof solution are your choice.

Especially if you and team have experience of delivering some product with it.
Otherwise, year later you will realize that your repositories contains complicated workarounds and dirty hacks in desperate efforts to build initial DSL.

5. Listen to your team.

Measure & prototype. Be open-minded for their approach for solution.

Do not abandon idea only because you are not author of it. Encourage them for think out of box (even if it mean be contradicted to your opinion).

6. Books are your friends.

Inspire your team to learn new things and professional growth.  Make your habit to read every day – in long run it will make a huge difference.
Short articles from HN do not help you to build foundation of knowledge – it is just a tip of iceberg.
You never can be sure that you know enough – treat with suspicious any “undisputed” statements (and those who dare to make them).

7. Take it easy.

World of IT and software development is hilariously small.
You do not know whom you will be interviewed by next time.
Who will be contacted for additional reference.

All makes mistakes. Not everyone learn from them.

Avoid any illusions from QA department – in most cases software will not work as you expected in first version.

Positive vibes during stern fuck-ups is what makes our profession truly awesome and memorable.
Humor is your best way to deal with burnout, pressure and broken coffee machine.

Sprinkle usual working day of your team with few bits of laugh to remind everyone that programming is fun! 😀