On interview preparation: algorithms and data structures recap

With this article I want to go a little bit further than widespread memos “X most common data structures for interviews”. I am not only interested in information about asymptotic complexity. The real value – understanding whether particular algorithms or data structure can be beneficial in particular case: real life tasks rarely given with hints like “Use directed graphs, Luke!” or which comparison function should you implement to have your favorite sort algorithm work accordingly to upper bound complexity from textbooks. On this page you can find not just theoretical high level summary but also few facts that I am considering important, practical or just afraid to forget.


Few considerations worth to notice before speculating about “Best” sorting algorithms:

  • How much data we are dealing with?
    • Dozens numbers or billions?
  • What is nature of data – i.e. what is cost of element comparison?
    • Integer\strings\complex objects? For example it may involve disks seeks and we should aim for algorithms that minimize comparison or even stick with non-comparison based algorithms?
  • What is the cost of elements transfer?
    • Should we aim to algorithms with minimum number of swaps/data copying
  • What is current state of data? Does it partially sorted?
    • In case of partially sorted data some algorithms may touch they upper bounds but adaptive class of algorithms can deal with this input in efficient manner
  • Are data allow random access?
    • For example: linked list vs arrays. Classical quicksort on linked list always be inefficient.
  • Are data fit to the memory?
    • Otherwise we need modifications of algorithms that support external sorting.
  • Are we are dealing with memory constrained environment?
    • Can allow auxiliary space or have to stick with in-place algorithms?
  • Is there possible duplicative values in array to be sorted?
    • Do we need stable sorting – i.e. should they keep their initial order from input array?

Comparative based algorithms – i.e. we compare the whole values of input array (as opposite to distinct bits of those values).  Any comparative based sorting algorithms lower bound worst case complexity – N log (N).

Comparative based algorithms – out of our league – not used in reality, but sometimes can be asked during interview

Algorithm Time complexity Space complexity Comments
Best Average Worst Auxiliary
Bubble sort O ( N ) O ( N ^ 2 ) O ( N ^ 2 ) O ( 1 )


Select sort O ( N ^ 2 ) O ( N ^ 2 ) O ( N ^ 2 ) O ( 1 ) At i-th iteration find index MIN of the smallest remaining entry and swap it i-th element with MIN.
N Exchange. (Stable for list.)


Practical comparative based algorithms

Algorithm Time complexity Space complexity Comments
Best Average Worst Auxiliary
Quicksort O (N Log N) O (N Log N) O ( N ^ 2) O ( Log N ) O( Log N ) probabilistic guarantee.
Pick a pivot, partition array in order to group all elems less than pivot in the left subarray, and those which are bigger – to the right.Recursively apply.
Pivot may be chosen randomly but there are better strategies like median of medians.
3-Way Quicksort O (K * N) ? O ( (2 ^ k) * N) O ( Log N ) Used in case of duplicate keys.
Pick a pivot element and partitions elements into three sets: less, equal or greater than pivot.
There are variants with two pivots.

Mergesort O (N Log N) O (N Log N) O (N Log N) O ( N )


NOTE: For linked lists:
O ( 1 )

Divide array into two sub-arrays and sort them recursively and merge in sorted order.
In simplest version – till sub-array have size 1, in more optimised – till it reach size of threshold (10-20 elements) and another sorting algorithms can by applied (insertion sort for example).
Easy modification for parallel and external sorting (aka not-fit in memory).
Less comparison than quicksort.
More element transfer than quicksort.
Not In-place.

Heapsort O (N Log N) O (N Log N) O (N Log N) O ( 1 ) Construct a max heap mapping array’s indexes to binary search tree.
After constructing recursively delete root from tree until it is not empty.

Insertion sort O ( N ) O ( N ^ 2) O ( N ^ 2 ) O ( 1 ) At i-th iteration swap i-th element with each larger entry to its left.
Useful for partially sorted array or small arrays – less than 10 elements.

Shell sort O ( N ) ? O (N ^ (3 / 2))

O(N * (Log N) ^ 2)
O ( 1 ) Like insertion sort but we rely on D-order sort, where D – gap between sorted entries.
All estimations depend on choosing gap sequence.
Used in embedded devices due to non-recursive nature – no need for support deep and memory hungry call stack).

Timsort O ( N ) O (N Log N) O (N Log N) O ( N ) Based on Merge sort and Insert sort:
tries to find a consecutive sequence of ordered elements and merge them to decrease necessary processing to final results.
Used in Java and Python as default sorting algorithm.

Not In-place.
Introsort O ( N ) O (N Log N) O (N Log N) O ( N ) Hybrid between quicksort & heap sort.
Used in c++ as default implementation of sorting.

Not In-place.
Treesort O (N Log N) O (N Log N) O (N Log N) O ( N ) Build binary search tree from elements and traverse it in-order.

Variations with self balancing binary search tree or Splay tree (Adaptive).
Estimations based on usage of balanced trees.

Not In-place.
Cubesort ? ? O (N Log N) O ( N ) Parallel sorting algorithm

Not In-place.

Noncomparison based sorting algorithms. For such kind of algorithms we do not compare the whole keys (values) but using individual bits of values (for example characters for strings or digits for numbers) – it allows to achieve linear time O(N) complexity. Price for it – necessity to tune it for every new type.

Algorithm Time complexity Space complexity Comments
Best Average Worst Auxiliary
Radix sort O ( N * K ) O ( N * K ) O ( N * K ) O ( N + K ) Examines individual bits of keys.
K – max number of digits in array’s elements.
Applicable for integer numbers, string and float numbers in special formatting.
Worth to use in situation when N >> K and K is fixed – famous question about sorting of million of 32-bit numbers.
MSD vs LSD – most significant vs least significant digits.
Bucket sort O ( N + K ) O ( N + K ) O ( N ^ 2 ) O ( N * K ) Partitioning an array into a number of buckets.
Each bucket is sorted either by recursive application of bucket sort or using other algorithms.
Can be used as external sorting algorithms.
Asymptotic given for cases when K ~ N.
Applicable in case data evenly distributed over a range.
Counting sort O ( N + K ) O ( N + K ) O ( N + K ) O ( N + K ) Count number of keys related to item, use this count to determine position of elements.
Asymptotic given for cases when K ~ N.
Applicable in case N ~ K.

N-th biggest\smallest element

just a variations of partial sort:

  • QuickSelect – worst case O(n^2) best case and average – O(n) – as quicksort with pivot based partitioning but move in one direction only
  • Median of medians – improve worst case of quickselect to O(n) – partition input to group of several(5) elements, find median within every group, and return median of n/5 medians
  • Introselect – back up nth_element at C++ STL – start as classical quick select and switch on median of median method in case depth of recursion is too deep and at the same time sizes of sub-partitions observed so far not leading to halving.


Family of hashing functions: H(X) = ((a * x + b) mod p) mod N) a,b < p, a!= 0, p – prime number used to reduce number of collisions (it depend on your keys distribution).

Two main approaches: separate chaining vs open addressing (linear probing) with double hashing.

  • Separate chaining – array of lists, at the very worst case – O(N) to retrieve record!
  • Open addressing – circular buffer – i.e. in case of collisions try to insert record to next un-occupied cell – which lead to poor performance in case of bad choice of hashing function. Load factor – how it will perform if 25\50\80 % of cells of array will be occupied. Used for hashtable implementation in python.

Separate chaining advantages:

  • Easy implementation of delete method
  • Performance degrades gracefully
  • Clustering is less sensitive to poorly designed hash functions

Linear probing advantages:

  • Less wasted space (pointers)
  • Better cache performance

There are various use cases & requirements for hashing that affect characteristics of good hash function:

  • Quickly insert\return records with mapping Key to Value – cheap cost of computing hash and low number of collisions
  • Find similar or close entries – need to maximise collisions if keys meet some criteria – locality sensitive caching (for example GeoHash) with rolling hash to speed up computations, another famous example: Rabin Karp string search algorithm.
  • Make it difficult to find key corresponding to particular hash value – one way function – Cryptography related hashing – fixed size output for any input, no collisions and at the same time almost impossible (time and computation consuming wise) to find input that produce particular hash
  • Cornerstone of more advanced data structures: bloom filter, merkle tree, distributed hash table.


Root always contains element that satisfy heap property: min-heap or max-heap. Elements not sorted but partially ordered.  Heaps may be constructed in O(N) using Floyd algorithm. Usually used as underneath structure for priority queues. Leftist heap as example of pure functional data structure.

Heap’s underneath datastructure Time complexity of operations Comments
Find Max Extract Max Increase Key Insert Delete Merge
Linked List (unsorted) O ( N ) O ( N ) O ( 1 ) O ( 1 ) O ( 1 ) O ( 1 )  
Array (sorted) O ( 1 ) O ( 1 ) O ( N ) O ( N ) O ( 1 ) O ( N + M )  
Binary Heap O ( 1 ) O (Log N) O (Log N) O (Log N) O (Log N) O (N + M) Heapify: O(N)


Binary Heap (array representation) – ordered complete binary tree – parent’s key no smaller than children’s key. Complete binary tree.
Binomial Heap O (Log N) O (Log N) O (Log N) O (Log N) O (Log N) O (Log N) A binomial heap of order K has a root whose children are roots of binomial trees of orders K-1, K-2, … , 2, 1, 0 – in this order.
Each binomial tree is a heap obeys the minimum-heap property: the key of a node is greater than or equal to the key of its parent.
There can only be either one or zero binomial trees for each order, including zero order.

Support quick merging of two heaps.
Fibonacci Heap O ( 1 ) O (Log N) O ( 1 ) O ( 1 ) O (Log N) O ( 1 ) Collections of trees.
Used for priority queue.
Degree of nodes (degree means the number of children) are kept low;
every node has a degree at most O(Log N) and the size of a subtree rooted in a node of degree K is at least F(K) + 2, where F(K) is K-th Fibonacci number.

Amortised constant factor affect practical usage. High memory consumption.


Full tree – every node except leafs have non-null children. Complete tree – full tree that have all nodes at the last level are lean left.

Tree traversal:

  • Pre-order – via dfs
  • In-order – via dfs
  • Post-order – via dfs
  • Level-order – via bfs

Search data structures

  • Size of element (read as cost of access\copy\compare operations)
  • Volume of data: can it fit to RAM?
  • Usage patterns:
    • evenly mixed read & writes or skew for particular operations?
    • access single key or range?
    • access to value using the whole key or interested to find all values where even parts of key matches?
Data structure Average time complexity Worst time complexity Auxiliary Space Comments
Indexing Searching Insertion Deletion Indexing Searching Insertion Deletion
Arrays O ( 1 ) O ( N ) O ( N ) O ( N ) O ( 1 ) O ( N ) O ( N ) O ( N ) O ( N ) Cache friendly, low memory overhead.
Pre-allocated vs dynamic.
Linked list O ( N ) O ( N ) O ( 1 ) O ( 1 ) O ( N ) O ( N ) O ( 1 ) O ( 1 ) O ( N ) Size of data vs size of pointer(s).
One directional vs bi-directional vs circular.
Sequential access vs random.
Internal or external storage (i.e. node contains only reference to actual data)
Foundation of stack, queue and dequeue.
Skip list O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( N ) O ( N ) O ( N ) O ( N ) O ( N Log N) Allow fast search using additional references hierarchy: next, the next after next, every 4th, i-th level – should contain 2 ^ i elements, but it used probabilistics skips – i.e. at i-th level, next reference may point to 2^i element or further.
Alternative for hash table and balanced tree – theoretical worst case performance can be worse, but in practice usually faster – depend on nature of data and cost of comparison operator.
Foundation of various lock-free implementations for dictionaries and priority queues. Used in Redis, LevelDb and Lucene.
Hash table (symbol table, associative array) ? O ( 1 ) O ( 1 ) O ( 1 ) ? O ( N ) O ( N ) O ( N ) O ( N ) Not best in case of small number of keys in comparison to direct mapping key -> array index. In practice, particular operation can be slow because of array resizing and cache un-friendly nature may lead to poor performance in comparison even with O(N) brute force lookup of simple arrays. C++: unordered_map, multimap
Python: dict
Java: HashMap
Binary Search Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( N ) O ( N ) O ( N ) O ( N ) O ( N ) Complete if it perfectly balanced, except for a bottom level. In worst case (depend on insertion order) it can be N-height. Constructing as quick sort. Application of problem: 1D-range search.
Cartesian Tree ? O ( Log N ) O ( Log N ) O ( Log N ) ? O ( N ) O ( N ) O ( N ) O ( N ) It is heap-ordered and symmetric (in-order) traversal of the tree returns the original sequence. Used for RMQ (range minimum query) kind of tasks.
B-Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( N )

Not just two child anymore => heigh of tree lower => lookup faster.

Various modifications.

B+-tree – actual data stored within leafs, intermediate nodes contains copy of keys.
Core data structure for relational databases as it allow to minimise disk seeks –
implemented in Berkeley DB.

2-3 Tree O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) ? Maintain symmetric order and perfect balance.
Red-Black Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O( N )

Balance is preserved via painting different nodes in Red\Black colours to satisfy some predefined properties. Re-arranging and re-colouring can be performed efficiently.

For example: no node has 2 red links connected to it. Red links lean left. Every path from root to null link has the same number of black links – perfect black balance.

C++: set and map.

Java: TreeMap and TreeSet (and in some situation HashMap fallback on their usage as well!)

Splay Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O( N )

Splay Tree is a self-adjusting binary search tree with additional property that recently accessed elements are quick to access again.

Applications: cache, garbage collectors, memory allocators.

AVL Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O( N ) Balance in AVL Tree – the heights of two child subtrees of any node differ by at most one. AVL Tree better for look-up intensive applications than Red-Black Trees
Kd Tree ? O ( Log N ) O ( Log N ) O ( Log N ) O(N Log N) O ( N ) O ( N ) O ( N ) O( N )

Space partition Binary Tree. Used for range search and nearest neighbour search. Constructing N Log N. Not suitable for high dimensional space – better to use when N >> 2^k.

Octree for 3D space.

More advanced structures worth to be aware of:

  • Trie (Prefix Tree) and its variations, mainly Radix Tree for lookup intensive applications. Worst case – O(M), where M is the length of string to search. DAFSA – can be a great alternative for cases with limited memory.
  • Generalized Suffix Tree – search efficiently with mismatches, applications for various string related tasks: longest common substrings, longest repeated substrings, longest palindromic substring,
  • Fenwick Tree (Binary indexed tree)– common use case: modify element of array and return result of some inverse operations for range of elements (for example partial sum but not max). Simpler and faster in comparison with segment trees. O(N) build time.
  • Segment Tree – classical approach for RMQ kind of tasks, operation not necessary to be inverse. Double memory requirement in comparison to fenwick tree. Constant factor affect speed in practice, especially in cases of multiple dimensions.
  • Implicit Treap – sequence of pairs, heap ordered by 1st entry, BST ordered by the 2nd. If we use index of element in array as first entry – than we can
    • insert\remove element at any position
    • move subset of elements in any position
    • computation of some function for elements within some interval [i, j] – sum, min – including range updates using lazy propagations
    • foundation for Rope implementation – to have similar asymptotic for very long strings
    • and all of the above in O(Log N) – with huge constant factor though
  • vEB Tree – ordered dictionary or associative array much faster (exponentially) than BST, memory overhead
  • Heavy-Light Decomposition: split tree on multiple not-crossing paths. Edge from A to B is considered heavy if number of vertexes reachable from B is at least half as big as number of vertexes reachable from A. Heavy-Light Decomposition – union of light vertexes from bottom of tree to the root. Such decomposition allows us to serve queries like “Get some metric on the path between A and B” – for example sum or max. Not needed in case we do not have update queries. Usually used with segment tree and LCA.

Union Find (Disjoint set)

Approaches: colouring, trees on top of arrays using ranks\randomisation\path compression & merging either randomly or using ranks of trees. Additional information can be stored with leaders of set.

Applications: offline LCA, offline RMQ, dynamic connectivity variables dependencies during program analysis, variations of percolation problem – connections in social network, fluids or electricity flow.


Representation: incidence matrix vs adjacency list. Generally adjacency list is preferable for sparse graph, but there are variation of incidence matrix based on associative array where pair of vertexes used as a key.

Possible variations:

  • Complete
  • Directed
  • Weighted
  • Planar
  • Bipartite 

Cut – set of vertexes from graph, if they will be removed – graph become disconnected. Size of min cut – connectivity or K-connectivity.

Graph related tasks

  • Find cycles in graph – union find (complexity – Akkerman function from number of vertexes) or DFS – proportional to number of vertexes
  • Determine whether graph is bipartite – dfs
  • Eulerian path – route visiting every edge exactly once (exist if all vertexes has even degree and it is connected) – search of such path can be done in linear time O(|E|)
  • Hamiltonian path – route including every vertex exactly once (as in Travelling salesman problem)
  • Comparison of graph – graph’s isomorphism (still open question N or NP)
  • Graph representation in such way that there are no edge crossing, except in vertexes (upward planar graph)  – there are some special cases where such checks can be done in O(N) but generally it is NP-hard problem. 
  • Minimum Spanning Tree (MST) – application is design of network (For example to avoid cycle of packets in Ethernet) or clustering, with custom distance function.
    • Prim – adding edge to current vertex with lowest weight
    • Kruskal – gradually merge trees from all vertexes using edge with lowest weight for connection


For directed graphs:

  • Topological sorting – tasks with ordering limitations, schedule related tasks (DFS) – create schedule in case we do not have cycle – re-draw graph in such way that all path follow from bottom to top – reverse dfs postorder (additional stack) = result
  • path finding – Dijkstra, if weights are negative – Bellman-Ford or its optimised version – Shortest Path Faster Algorithm (SPFA)
  • Shortest path for multiple sources – dense graph with negative weights – Floyd-Warshall or not dense – Johnson’s algorithms or Dijkstra 
  • Search for cycle – DFS – detection of deadlock in concurrent systems or topological sort
  • Maximum flowFord–Fulkerson algorithm , Bipartite matching problem, baseball elimination
  • Dynamic connectivity
  • 2-SAT – satisfy system of constraints of boolean variables – scheduling, medicine testing. Can be represented as directed implication graph. Checking whether two vertexes are within the same connected component allow to answer question whether we can satisfy all conditions.

Substring search

  • Knuth-Morris-Pratt (KMP) – based on DFA (deterministic finite state automaton) that have to be pre-computed for pattern. O(R + N) where R – length of pattern and N – length of text. O(RN) space.
  • Boyer-Moore – scan from the end of pattern and in case of mismatch skip up to R characters – worst case is as brute force O (R * N) but on average considered the most efficient algorithms – O (N / R).  O(R) space.
  • Rabin-Karp – compute sliding hash for every substring of length R for hashing can use modular hashing with some prime number. For efficiency sake Horner’s method can be utilised to re-use computation from previous steps – O(R). Hash match – still can be a collisions so there are two variations in case of hash matches. 2D search or search for multiple patterns using pre-computed table of hashes. O(1) space.
  • Aho-Corasick – in case we are dealing with multiple patterns and text where we search them are different. Build trie for all words, extend it to special finite state machine without backtracking – for failed patterns using suffix links. Used in grep – -F flag.
  • Z-functions: Given string A, consider array of all possible suffixes: A[len(A)], A[len(A) – 1], … , A[1]. Compare symbols from the start of string and every suffix – number of matched symbols – represent particular component of Z-function. For sub-string search: concatenate pattern, unique delimiter and original string and check if values of Z-function equal to length of pattern.

Собеседование на программиста в 2019 – к чему готовиться

Осень-зима 2018-2019 года выдались у меня на редкость насыщенными: ориентировочно, с октября по февраль я поучаствовал в доброй сотне собеседований на позиции уровня Senior-Principal-Lead. Четыре из них были из России (ну мне честно было интересно как тут у нас сейчас), остальные Европа и Англия. Куда-то дальше по разным причинам не хотелось – хотя и пообщался с парой компанией из солнечной Австралии. В этой заметке хочется поделиться накопленными наблюдениями для тех кто задумывается о смене работы, в первую очередь, ориентируясь на иностранные фирмы. Сначала немного сухой статистики, а потом личные выводы.

Первый контакт: процентов 70 представители компаний или рекрутеры находили по профилю в linkedin. С остальными связывался сам – преимущественно откликаясь на любопытные мне вакансии со stackoveflow jobs\hacker news who is hiring\angel.co.

Этапы собеседований


Опциональная беседа с ичаром компании или рекрутером. Цель определить что вы:

  • можете связать пару слов по английски (здесь будет полезно подготовить кратенький монолог на тему “последние 5 лет моей карьеры”)
  • прояснить круг обязанностей по позиции – надобно только писать, помогать писать другим или перед всем этим сначала еще обговорить с клиентами что собственно делаем
  • обсудить желаемую локацию – были случаи когда указаная страна внезапно менялась на другую, например по причинам визы или так как самые hardcore разработчики исторически осели там
  • удостовериться что вы не страдаете особыми закидонами в плане общения.
  • в ряде случаев – определить уровень компенсации дабы не было в конце взаимного разочарования из-за бездарно потраченного времени. Это бывает удобно когда у компании строго определенный бюджет и выйти за него они не могут, а вам меньше тоже не особо интересно. После того как первый раз мне выкатили офер процентов на 40 ниже того что я ищу – в не очень именитые фирмы этот вопрос поднимал всегда до каких либо технических собеседований.
  • обсудить релокацию – что по визам (особенно сейчас актуально для Штатов), оплачивают ли жилье на первое время – и сколько его этого первого времени, мед-страховка для членов семьи, предоставляется ли контейнер на случай если вы желаете переехать с любимым диваном и коллекцией пивных кружек, помогают ли с сопутствующей бумажной волокитой и поиском жилья.

Tech screening – первая кровь

Компании высшего эшелона – гуглы, фейсбуки, амазоны и подобные им, на первом этапе презренные деньги не обсуждают, однако въедливо скринят по нескольким тематикам (в зависимости от позиции, если программист – то обязательно алгоритмы + предметная область – специфика вашего любимого языка, математика, машинное обучение, сети, потроха линуксов и т.п.) – дабы удостовериться что стоит на вас спускать инженеров. Встречаются и небольшие амбициозные компании которые норовят устроить экзамен на знание акронимов из вашей предметной области.

Подавляющие же большинство предпочитают полагаться на автоматизированные тесты на https://hackerrank.com, https://www.codility.com/ . Обычно это 1-3 задачки которые надо решить (написать код, запустить и удостоверится что проходят все тесты) в условиях неотвратимо цокающего таймера. Задачки варьируются уровнем от чисто алгоритмических (легкие\средние с https://leetcode.com/), до а вот запилите нам простейший блокчейн или каталогизатор фоток. Корректность решения проверяется заготовленными тестами на самой платформе. А как ваш код будет работать если вместо int придет long? А что если в последовательность положительных чисел вклинится одно отрицательное? Тесты вам могут быть доступны все (т. е. вы запустили – система показала какой тест завален – вы скачали данные для проверки), доступны частично (т. е. вы видите что из 100 тестов, 99 failed а почему и что там может быть – додумывайте сами), а могут быть и не показаны вовсе (считаете что все красиво – жмакайте кнопку “я сделалъ”). Таймеры у всех разные – у Amazon было два часа на две нудные длинные задачи с регэкспами, у Yelp 20 минут на одну. Могут быть и вопросы с вариантами ответов как у Optiver или даже практикум на живой виртуалке с линуксом как у Booking.

Некоторые предлагают заполнить форму с вопросами – аля как мы можем использовать технологию X, или программа на языке Y работает медленно – опишите как вы будете ускорять ее. Здесь убивают сразу двух зайцев – проверяют на базовое знание матчасти (вопросы обычно классом поглубже чем поверхностное гугление) и ваш талант кратко и по делу донести свои мысли деловым английским.

Online assessment – онлайн собеседование

Если удалось справится с тремором по поводу гонок со временем – то второй этап либо онлайн кодинг, либо так называемый take home (задачку на дом) и его последующее обсуждение.

В первом случае вас ожидает несколько сессий где вас попросят написать что-то на алгоритмы и обсудить вариации решения, подкидывая на ходу дополнительные условия, чтобы оценить насколько расширяемо вы пишете код. Большинство не парится на тему конкретного языка и заинтересовано в первую очередь в верном алгоритме, исходя из теории был бы человек хороший – а принятый в компании стек можно и освоить. Примеры задач – в добавок к упомянутым https://hackerrank.com, https://leetcode.com, еще можно посмотреть на https://careercup.com. Сложность и количество задач всегда зависит от двух факторов: позиция и ваша синьорность – вполне могут дать одну посложнее вместо пары-тройки легких.

Задачка на дом представляет собой уже что-то более-менее приближенное к реальности – как правило это уже тот стек на котором вам предстоит ударно трудиться. Здесь приветствуются авто-тесты, авто-документация и все best practices применимые к языку системе сборки и документации которые так уютно существуют в самых смелых мечтах разработчиков. Код обычно размещается в приватном репозитории чтобы ваши конкуренты не повторяли ваши ошибки, а делали свои, а работодатели не ломали голову над очередной дилеммой: как бы и кандидата проверить в около-боевых условиях, и, при этом, ему не надо убивать на это более двух вечеров. Далее назначается дата и обсуждаются результаты – то ли вы вообще решали (а у меня был случай что я в попыхах не верно интерпретировал условие), решили ли (опять же был весьма досадный опыт когда поверхностно проверил не на всех наборах данных и пропустил фатальную жирную багу) насколько код готов быть развернут на боевом сервере – много обращений, много данных, низкое качество данных – в параллели заодно затрагиваются детали языка\фреймворка\стиля.

Onsite interview aka face to face

В случае успеха остается последний рубеж. Компании попроще на этом этапе ограничиваются общением с непосредственным руководителем и снова HR – behavioral interview, cultural fit – дефицит кадров еще не значит, что кто-то хочет попрать священное правило “No asshole!“. Некоторые примеры вопросов:

  • Опишите ситуацию когда вы облажались – какие уроки вы извлекли из этого
  • Как вы считаете какие ваши сильные и слабые стороны назвал ваш предыдущий начальник\коллеги\подчиненные
  • Что вы будете делать если почувствуете что у вас напряженные отношения с кем то в команде
  • Что вы будете делать если не согласны с каким-то техническим решением? Что если оно уже утверждено?

Те же у кого финансы позволяют – могут пригласить вас в офис чтобы вы пообщались на отвлеченные темы с представителями из разных команд, воочию посмотрели как проходит обычный рабочий день – как происходит взаимодействие между людьми. Это может внести некоторую заминку из-за необходимости получения визы – особенно в случае если вы не из Москвы.

Hard mode: on

Вдобавок к выше перечисленному если компания большая, а имя ее заставляет рекрутеров делать стойку и дает вам возможность добавлять себе в титул приставку ex-Имеряк’ер – тогда тут-то и начинается самое интересное. Целый день вы проведете в офисе компании. С вами будут говорить такие же технари – представители нескольких смежных отделов. Несколько секций с решением уже знакомых алгоритмических задачек – подтвердить продемонстрированный ранее уровень, но уже в окружении скучающих интервьюров на доске, листе бумаге или хромбуке. Сессия вопросов и ответов (с вас) по специфичным для позиции темам. Специализированные интерактивные секции – (тут вы должны постоянно задавать наводящие вопросы определяя ограничения задач) – спроектировать архитектуру некоторого приложения, затраблешутить багу в проде. Это будет насыщенный, но не самый легкий день в вашей практике.

Далее уже не так интересно – торг по зарплате, обсуждение условий релокации и бесконечная бумажная волокита.

Из забавного

Из-за перепутанных часовых поясов пришлось отвечать на вопросы по телефону во время пробежки с псом, между заходами в парилку в бане и на первой попавшейся парковке из машины.

В ОАЭ всех претендующих на работу на госслужбу, кроме всего прочего, тестируют с помощью платформы TalentQ – первая попытка прохождения заставила меня крепко зауважать всех госслужащих этой страны. Похожий тест – Criteria Cognitive Aptitude Test (CCAT) – просят пройти, например, в Crossover до любого рода технических тестов. Вопросы там разбиты на три подгруппы: логические – найти лишнюю крякозябру в списке, подставить недостающий петроглиф в строку; около математические – на основе таблицы с кучей цифр подсчитать арифметическое среднее или разницу в процентах чего-нибудь с чем нибудь; прочитав пару абзацев текста выбрать основную мысль или чью-то точку зрения – в простых случаях найти антоним или синоним к какому-нибудь заковыристому слову.

В одном из заданий на дом требовалось вытаскивать некоторые публичные данные и, не смотря на тривиальность задачи, сервер упорно плевался таймаутами. Я вообразил хитрую защиту от пауков и с вдохновением принялся рандомизировать агента, развернув сервис в локали, благо он был заопенсурсшен, проверял CORS и куки. Апофеозом был анализ tcp дампа в wireshark‘e и запоздалая проверка traceroute к сайту. Выяснилось что сайт хостился на айпишнике из забаненных диапазонов, а во всех браузерах были vpn плагины.

Иногда по результатам рассмотрения резюме присылали отказ, а спустя пару недель рекрутер той же самой компании зазывал пообщаться.

На дату публикации – самая длительная пауза между откликом и ответом составляет три месяца. Победители в этой номинации запросили целый набор документов (ASAP!11) для того чтобы запланировать интервью.

Самопозиционирование как architect – иногда вводило рекрутеров в заблуждение и предлагались позиции связанные с очередной стройкой века в некоторых странах Персидского залива.

Некоторые выводы и наблюдения

https://www.timeanddate.com/worldclock/converter.php – ваш друг и помощник – когда самая главная головная боль – понять когда у кого сколько времени

Растет число компаний предлагающих удаленную работу по рейтам не зависящим от вашей локации. Задачи наконец-то начинают отличаться от “in-house сотрудники не хотят возиться с унылым легаси” до вполне cutting edge – дизайнить дата пайплайны для финансовых платформ, SRE\DevOps для вполне серьезных хайлоадов, ML – для предсказания цен и рекомендаций в разных отраслях.

Планирование сроков – собеседований и рассмотрения предложений – позволяет задать и себе и компаниям осязаемые дедлайны. Это дисциплинирует.

График собеседований лучше составить так, чтобы сперва для разминки вы пообщались с мало-интересными вам компаниями, а уже потом, на пике формы, переходили к компаниям мечты (но еще не успели выгореть от бесконечных “есть два целочисленных массива … “).

Звонков и вообще всяческой суеты от рекрутеров будет много – они будут пропадать, внезапно звонить спозаранку или ночью, вы вполне можете путать их имена и компании которые они представляют. Собеседование это тоже работа и тут вы, чем сможете, должны им помогать, понять и простить. Хороший рекрутер может помочь вам с согласованием предлагаемого работодателем пакета и выбить комфортные условия релокации, или предложить что-нибудь интересное через год когда вы уже напрочь о нем забыли.

Российские компании (из тех с кем я общался) куда более зациклены на конкретном стеке кроме, пожалуй, Яндекса. Уровень вопросов вполне сопоставим с общемировым. С учетом налогов и даже без учета стоимости жизни в Москве и Питера можно получать больше чем в Европах, даже без менеджерских лычек.

Как ни странно – надо готовится. И тут чудес не бывает – брать Кормэна, смотреть лекции Седжвика и в виме простом блокноте писать на том языке на котором планируется проходить собеседование. Для философских разговоров хорошо заходят Architecture of open source applications про хайлоад – DDIA, для SRE – Sire Reliability Engineering и моя любимая книжка с черепашкой – Операционная система Unix Робачевский.

P.S. И просто побрюзжать

Современные IDE отупляют разработчиков – иначе я не могу объяснить причину появления сонма фундаментальных серьезных руководств как писать программы в текстовых редакторах (без авто-комплита, подсветки, подсказок и возможности скомпилировать или запустить программу чтобы высветился номер строчки с ошибкой).

Я еще помню времена, когда вопросы частенько задавались заведомо не правильные, на знание стандартов C++, undefined behaviour, depend on compiler implementation. Сейчас, к моему удивлению, ряд вопросов служат тому, чтобы выяснить что вы будете делать если нет интернета или стековерфлоу лежит, а с кодом надо что-то решать.

Иногда формализм побеждает – когда спрашивающий общий вопрос хочет услышать свой любимый ответ или занимается проверкой вашей памяти, задавая вопросы на терминологию (многие бы сильно удивились узнав что они используют какие-то из известных паттернов проектирования или определенные нормальные формы в БД). Из запомнившихся вопросов подобного плана – CAP теорема, что значит каждая буква в SOLID или перечислить те самые двенадцать факторов.

Коррелирует ли как-то способность решить задачу про кроликов или свести проблему Пингвин-Авиа к поиску минимального остовного дерева за полчаса интервью с тем как кандидат будет работать – не знаю. Поможет ли вам знание о существовании неявного декартого дерева в работе – сильно зависит от того над чем вы будете работать. Расширит ли алгоритмическая подготовка ваше сознание – несомненно!

Собеседования с вменяемыми компаниями (не ленящихся прочитать резюме, без подсказок дающих обратную связь) – прекрасный способ держать себя в тонусе. 🙂

[docker][cassandra] Reaching mixed load – 750,000 op\sec

The cart goes nowhere because the swan wants to fly in the air, the pike wants to swim underwater and the crawfish wants to crawl backward.

Cassandra performance tuning - challengeCassandra is one of powerhorse of modern high load landscape – nosql database that can be pluged to Spark for distributed computing, Titan for playing with graph data representation and even to Elastic as a search backend.
And if you really care about pure write performance – this is de-facto choice in world of open source solutions: production proof, with big community that already generated many outdated controversial answers at SO & mailing lists.


No single point of failure, scalability, high availability, retention periods, … , but those marketing claims hide few principal caveats… Actually, cassandra has only single drawback(*) – it will not reach its limits with default settings on your hardware. Lonely, single node configuration – it is not use case for cassandra, it will shine in multinoded clustered setup.

If you really want to see full utilization of endless cores and crazy amount of RAM, you have to use some virtualisation technology to manage hardware resource.

Let me start first with some conclusions and recommendations, based on extensive two monthes testing and observartion of trouble tickets after migration of this approach to production. With those considerations in mind I was managed to configure it such way to tolerate 750 k mixed operations per seconds. It was generated for more than 8 hours to check pressure tolerance and emulate peak loads. .It was mixed execution of async inserts, without future processing and synced inserts as well as read requests.

Frankly speaking, I am sure it is still far from its limit.

Bear in mind that I am talking about Cassandra 2.1.*.

About disks

  1. Use ssd disks as mapped volume to docker container. Single container = single dedicated disk.
    It is possible to use multiple disks per containers, but it will lead to 5-15 % of slowdown.
  2. If you use ssd disk you can map all casandra directories to it (saved_cache, data, commit_logs) and adjust casandra.yaml with higher values of throughput, in particularly: compaction_throughput_mb_per_sec, trickle_fsync
  3. It is really depends on data distribution and your data model, but be ready that disk utilization will vary from one node to another up to 20%
  4. Docker should be configured to NOT use host’s root partitions. Don’t be mean and allocate single drive for logs and choose proper storage driver – docker-lvm.
  5. In practice, cluster start strugling when any of nodes come out of space. Surprisingly, in my experiments it was stable even with 3% free, but in real life better to configure your monitoring system to give alert at 15-20%.
  6. Choose compaction and compresion strategies wisely when you design your db
  7. Be careful with column naming – it wil be added for every god damn row!
  8. Do sizing when you think about number of nodes (and disks).

About cpus:

  1. More cpu per node is NOT always good. Stick with 8 cores per node.
    I’ve experimenting with single fat supernode per physical server = 48 cores, 4×12, 6×8.
    6 node with 8 cpu cores outperform all others in 6 kind of stress load scenarious.
  2. If you play with core number you have to adjust few settings at cassandra.yaml to reflect that number: concurent_compactors, concurent_reads, concurent_writes.
  3. Cassandra in most cases endup to be cpu-bound, don’t forget to left for host system 8-16 cores, and allocate cpu exclusivly for containers using –cpuset-cpus

About RAM:

  1. cassandra-env.sh have builtin calculation of free memory to adjust jvm settings using analysing results of command free. Ofcourse it is not for docker based setup. Bear this in mind and tweak your startup scripts to substitue values there.
  2. Disable swap within docker using –memory-swappiness=0
  3. Effectiveness of memory usage depend on cpu amount, how effective multithreaded compaction is implemented at Cassandra and what settings for reader\writer\compactors you have at your cassandra.yaml, i.e. you can have hundreds of RAM but endup in OOM. But even with 8 Gb of RAM per node you already can see benefits. More RAM – mean more memtables, bigger key cache, and more effective OS-based file caching. I would recommend have 24 Gb RAM per node.
  4. Disable huge page at host system or at least tune your jvm settings:
 echo never &amp;amp;gt; /sys/kernel/mm/transparent_hugepage/enabled
echo never &amp;amp;gt; /sys/kernel/mm/transparent_hugepage/defrag 

About network

  1. Mandatory to use network stack of host Os using flag at docker –net=host
  2. Most likely network should not be bottleneck for your load, so you can stick with virtual interfaces on top of single real one.


  • 3 physical server: each have 72 cores, 400gb ram
  • Cassandra 2.1.15
  • Docker: 1.10
  • Host Os: Centos 7.5
  • Guest Os: Centos 7.5
  • java 8 from oracle with jna

Cassandra 3.* this is competely another story – in my opinion, mainly, because of storage engine changing, but here is a huge list.

DB overview:

  • Dozen keyspaces, each have up to 20(?) tables.
  • Few indexes – just do not use indexes, design schema properly
  • Data replication = 3, gossip through file
  • Each physical server represent dedicated rack within single datacenter.
  • Row cache were disabled at cassandra.yaml i.e. first priority was to focur on write oriented workload


  1. datastax stresstool, artificial table – very intresting, but useless, using your schema is very important
  2. Datastax stresstool + your own table definition – nice, give hints of production performance. But you still testing single table – usually it is not the case in real life.
  3. Self written in-house stress tool that generate data according to our data model in randomized fasion + set of dedicated servers for ddos with ability to switch between async inserts (just do not use batches) with and without acknowledgment.
    Once again: no batch inserts as they should not be used in productions.
  4. Probably, you can adapt Yahoo! Cloud Serving Benchmark. I haven’t played with it.


That’s it folks, all craps below is my working notes and bookmarks.

How to get c++11 at Centos7 for stress tool compilation:

Install recent version of compiler on centos7: devtoolset-4
update gcc version 4.8 at centos 6: https://gist.github.com/stephenturner/e3bc5cfacc2dc67eca8b

scl enable devtoolset-2 bash

RAM & swap:

How to clean buffer os cache

echo 3 &amp;amp;gt; /proc/sys/vm/drop_caches
check system wide swappiness settings:
more /proc/sys/vm/swappiness

Docker related:

If you are not brave enough to play with DC\OS or Openstack you can find docker-compose to be usefull for manipulation of homogeneous set of containers


sudo rpm -iUvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
rpm -qa | grep docker

If you fucked up with partition settings:

wipefs -a /dev/sda1

Docker and mapped volumes: http://container-solutions.com/understanding-volumes-docker/

If docker info | grep loopback show you something – you already screw up configuration of storage driver.

How to check journal what is happening:

journalctl -u docker.service --since "2017-01-01 00:00:00"

Full flags described here.

Usefull commands for check docker images:

docker inspect
dmsetup info /dev/dm-13


How to check heap memory consumption per node:

nodetool cfstats | grep 'off heap memory used' | awk 'NR &amp;amp;gt; 3 {sum += $NF } END { print sum }'

How to check what neighbors we can see from nodes:

nodetool -h ring

How to find processes that use swap:

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | awk '$2 {print $1 FS $2}'

Check how many disk memory we used, from Cassandra perspective

nodetool cfstats | grep 'Space used (total)' | awk '{s+=$NF} END{print s}'

Determine disk usage, OS point of view:

du -ch /var/lib/cassandra/data/

cassandra check health:

 ssh nodetool status


How to open port at centos 7:

firewall-cmd --zone=public --add-port 9160/tcp --permanent
firewall-cmd --zone=public --add-port 9042/tcp --permanent
firewall-cmd --zone=public --add-port 7200/tcp --permanent
firewall-cmd --zone=public --add-port 7000/tcp --permanent

Open ports for spark, master:

firewall-cmd --zone=public --add-port 7077/tcp --permanent
firewall-cmd --zone=public --add-port 8081/tcp --permanent

Apply changes in ip tables:

firewall-cmd --reload

or do this, usefull in case network manager behave badly:

systemctl restart network.service
systemctl status network.service

And as bonus points –

How to migrate data from old cluster to bright new one

  1. sstableloader
  2. cassandra snapshots
  3. For tiny dataset to get cql file with inserts: – cassandradump

First two approaches represent standard way of data migration.
Limitation of first is speed and necessity to stop old node.
Limitation of second is necessity to manualy deal with token_ring on per node basis.

If life was really cruel to you, you can play with data folders per node.

NOTE: if you can replicate exact same setup – in terms of assigned ip, it will be enough to just copy cassandra.yaml from old nodes to new one, and use exact same mapping folder within docker as it were at old cluster.

If not – you still can do it with copying data folder follow steps below, but better just use sstableloader.

  1. In order to do it you have to run following command on every node to drain node from cluster and flush all data into filesystem:
nodetool drain</pre>

NOTE: this is unofficial, not recommended way to deal with data migration.
NOTE 1: it require you to have similar amount of nodes in both clusters
NOTE 2: no need for the same datacenter\rack cfg\ip address

2. Deploy docker based setup according to HW configuration. Total amount of nodes should be equal to total amount of nodes at old cluster. On new cluster deploy exact schema that were deployed on old cluster.

3. Stop new cluster.
within every node data folder of OLD cluster you would have following folders:

NOTE: do not touch system* tables.

4. Under folder /your/cassandra/data-folder/your-keyspace
you should have set of folders corresponding to that keyspace under which data is stored.

5. You have to copy content of this folder (*.db, *.sha1, *.txt) for every node from OLD cluster to corresponding folder of NEW node cluster in. UUID WILL be different.
I.e. old cluster, node 1 to new cluster, node 2:
data copy example
scp /old/cluster/cassandra-folder/data/your-keyspace/your-table-e31522b0e2d511e6967a67ec03b4d2b5/*.* user@:ip/new/cluster/cassandra-folder/data/your-keyspace/your-table-c56f4dd0e61011e6af481f6740589611/

6. Migrated node of OLD cluster must be stopped OR you have to use `nodetool drain` for processed node to have all data within sstables ~ data folders.

Performance monitoring:

  • general system overview: atop or htop
  • Be sure that you understand memory reporting.
  • JMX based monitoring: jconsole
  • Jconsole connection string: service:jmx:rmi:///jndi/rmi://:/jmxrmi
  • dstat network & disk io
  • strace – show every system call. slow down. can connect to running.
  • netstat -tunapl | lsof -i -P – network\ports per process
  • docker stats – reports cpu\mem\io for container
  • perf + perf-map-agent for java monitoring:
    for example cache miss, more there:
perf stat -e L1-dcache-load-misses

Articles that I find usefull:


*cassandra has only single drawback – it have no idea of your data model, whether you configure your data schema correctly, what is your load patterns. That why you have to dive in wonderland of controversial recommendations in blogposts like that one instead of thoroughly read documentations first.


P. S. do not believe anyone, measure!



How to stop being a junior – 7 hints of programmer productivity

0) If you don’t know something – don’t afraid to ask.

Especially if you already checked first page of google search and pretty sure that no one ask that question at stackoverflow.
Reinventing the wheel and breaking the stalemate can be a good exercise for your home projects, but in production environment better to check idea with your mentor before diving into implementation details.

1) Don’t afraid to show that you have no idea.

real programming - do not left open questions ever

do not left open questions ever

Just add note for yourself to figure out it later.
If you do not understand how something works – obviously this is gap in your knowledge.
And you can’t just skip it – you are software engineer – this is your obligation to be aware what is happening under the hood.

And yes, sometime you have to do it at your own time.

Professional growth is very simple:

  • step 1 – find something that you don’t know,
  • step 2 – start investigation, discover bunch of additional mysteries and repeat step 1

2) Split problem to several simple questions and address them one by one.

Don’t try to troubleshoot huge multi-component system within real data to reproduce the problem.
Forget for a minute for overall complexity of your enormous project, analyze suspicious functions one by one independently of each others.
Use online fiddle for your language to check obscure part of language with fake data returned by mock api.

3) stop wasting your gorgeous time

If you find yourself googling the same commands over and over again – start making notes.
Create txt file with most useful commands and update it whenever you find yourself googling again.
Search for ready to use cheat sheet or even put wallpaper on desktop.

4) Do not stop to invest time in reading proper books. Ever.

Pressure, deadlines, laziness.
This is not for company, boss or to bragging about.
This is your main and primary investment to the future – your knowledge is treasure.
30 minutes of reading every day – not too much,
but in long run – you will be noticed that you become more capable and be able to tackle previously hard to solve problems.

from junior to senior - bite by bite

from junior to senior – bite by bite

5) You should properly digesting advice, articles and opinions.

Always will be people who are picky about your code.
Always will be deadlines where you have to make compromise.
Always will be people who haven’t seen big picture or just too stubborn to like ideas of other.
Be wise and pragmatic:
first and foremost you earn money for get your job done.
focus on that target along the way try to do it concise and efficient but at the same time meet your own deadlines.
New technologies, new languages, new paradigms and new patterns give it a try only when this shit is working.

6) Do not ever ever ever do work that you do not like or not interested in.

You will do it mediocre at best.
Your work should be your passion, your pride and not amount of hours behind the desk exchanged for paycheck.
If you are not happy at the morning before the workday – you have to change something. Urgently.
Excitement, challenge and satisfaction should be your main motivation for every day.
Money and career opportunities always follows three guys above 🙂

db mix – postgres, sqlite, cassandra, aerospike & redis


How to delete all tables from sqlite:

SELECT 'DROP TABLE ' || name || ';' FROM sqlite_master WHERE type = 'table';

Find duplicates by field “field_name”:

SELECT field_name, COUNT(field_name) AS cnt FROM some_table GROUP BY field_name HAVING(cnt &gt; 1 ) FIXME - use cnt?

Find records changed in last 5 days:

SELECT * FROM some_table WHERE created_at &gt;= NOW() - '5 day'::INTERVAL;

Get table definitions:

pragma table_info(mass_connections);

Export select query to csv:

.mode csv
.output result_of_query.csv
select * from my_table;
.output stdout 

Import data from csv into fresh new table:

.mode csv
.import /path/to/your/all_data.csv new_table


How to show all tables with sizes within database

SELECT schema_name, relname, pg_size_pretty(table_size) AS size, table_size FROM ( 
SELECT pg_catalog.pg_namespace.nspname AS schema_name, relname, pg_relation_size(pg_catalog.pg_class.oid) AS table_size 
FROM pg_catalog.pg_class 
JOIN pg_catalog.pg_namespace ON relnamespace = pg_catalog.pg_namespace.oid) t 
WHERE schema_name NOT LIKE 'pg_%' ORDER BY table_size DESC;

Show average amount of records per table

SELECT schemaname,relname,n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;

How to create data-only dump:

pg_dump -U your_pg_user -h pg_ip_address -p pg_port -a --column-inserts db_name > file_name.sql
pg_dump -U your_pg_user -h pg_ip_address -p pg_port -a --column-inserts --table=table_name db_name > file_name.sql

Useful pg_dump flags:

  • – C adds the CREATE statements
  • – s dump schema only
  • – a dump schema & data
  • – D dump using inserts (to simplify uploading data from PG into another db engine)

How to restore data:

psql dbname < infile.sql

PG stop/start:

$(PG_HOME)/bin/pg_ctl -D /data stop -m immediate
$(PG_HOME)/bin/pg_ctl start -D /data -l logfile


Get settings:

asinfo -v 'get-config:'

Set particular settings:

asinfo -v 'set-config:context=service;batch-max-requests=10000000'
asinfo -v 'set-config:context=network;timeout=10000000'
asinfo -v 'set-config:context=service;batch-index-threads==100'

How to register LUA-script:

register module 'your_script_name.lua'

more at http://www.aerospike.com/docs/guide/aggregation.html

How to build secondary index based on bin

CREATE INDEX _idx ON . (bin_name) NUMERIC

bin_name ~ field name

How to delete all records within set:


How to register lua script:

redis-cli script load "$(cat /YOUR/PATH/script_name.lua)"


How to save results of query to the file:

cqlsh -e"select * from table_name where some_txt_attr='very tricky string';" &gt; cassandra_file_query_result.txt

How to check node health:

nodetool status | awk '/^(U|D)(N|L|J|M)/{print $2}'

How to check compression ratio for particular table:

nodetool -h cassandra_ip cfhistograms some_keyspace some_table

How to check the dropped tasks count (at the bottom) at particular node:

watch -n 1 -d "nodetool tpstats"

How to do a “backup: of  cassandra:

nodetool snapshot

It will generate snapshot of data at /<your path from yaml file>/data/snapshot.

How to do a “restore” from snapshot:

      stop cassandra


      delete content of every keyspace table at /<your path from yaml file>/data/


      copy data from snapshot to the respective keyspace folder


    restart the server

7 sins of blatant ignorance and how to avoid them

…You produce software for some time. In most cases it even works.
Other developers tend to care about your opinion.
Damn, you even wear some fancy title like senior\principal\architect.
And here it is – suddenly you were offered to wear really posh title – CTO…

This is moment when real troubles get started.

I did mistakes by myself. I fought with others to prevent them from repeating of my past vicious moves.
And I wish list below appear every time I was about to make game changing decision.


A lot of challenges are awaiting for new CTO. How they can be avoided?

A lot of challenges are awaiting for new CTO. How they can be avoided?

1. Claim-to-be-universal tools suck. Always.

Do not make assumption based on bright hello-world example at promo web site.
In your case – it would be necessary to have some tricky functionality that this mega-framework does not support by design.

2. Be suspicious to any black-box like solution that promise all and everything at no price.

Never-ever expect that you are aware of the deep technical details.

In production, during important demo or pitch – impossible cases tend to jump out regardless of claims from probabilistic theory.
Listen to your team – they are your experts who (should) know about tiny nuances about setup, implementations and limitations.

3. Start simple. Focus on creating _working_ prototype fast.

Forget about speed, scalability, cluster, gpu-computing or “best practices” declared by yet another guru.
In 99.99999 percents of cases you do not need load balancing or advanced caching strategy.
You will iterate if it necessary.

4. Trendy stuff sucks.

Do not waste your time in fighting with bugs of another pre-alpha release of some looks like promising tool.

New database engine \ fresh from research lab language \ trendy paradigm – should be out of your list of consideration.

You need get stuff done. That’s it.
Good old bullet-proof solution are your choice.

Especially if you and team have experience of delivering some product with it.
Otherwise, year later you will realize that your repositories contains complicated workarounds and dirty hacks in desperate efforts to build initial DSL.

5. Listen to your team.

Measure & prototype. Be open-minded for their approach for solution.

Do not abandon idea only because you are not author of it. Encourage them for think out of box (even if it mean be contradicted to your opinion).

6. Books are your friends.

Inspire your team to learn new things and professional growth.  Make your habit to read every day – in long run it will make a huge difference.
Short articles from HN do not help you to build foundation of knowledge – it is just a tip of iceberg.
You never can be sure that you know enough – treat with suspicious any “undisputed” statements (and those who dare to make them).

7. Take it easy.

World of IT and software development is hilariously small.
You do not know whom you will be interviewed by next time.
Who will be contacted for additional reference.

All makes mistakes. Not everyone learn from them.

Avoid any illusions from QA department – in most cases software will not work as you expected in first version.

Positive vibes during stern fuck-ups is what makes our profession truly awesome and memorable.
Humor is your best way to deal with burnout, pressure and broken coffee machine.

Sprinkle usual working day of your team with few bits of laugh to remind everyone that programming is fun! 😀

Information retrieval, Search and recommendation engine:

Natural language processing:

https://class.coursera.org/nlp/lecture – Processing of texts written in ordinal languages
https://company.yandex.com/technologies/matrixnet.xml – search algorithm by Yandex

Information retrieval:

http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html – Introduction to Information Retrieval, Cambridge
http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf – Latent Dirichlet Allocation (LDA)

Few examples of applications for the above:


Recommender systems:

https://www.coursera.org/learn/recommender-systems/ – video lectures – 101 for Recommendation System
http://www.ibm.com/developerworks/library/os-recommender1/ – introduction to approach and algorithms
http://www.cs.bme.hu/nagyadat/Recommender_systems_handbook.pdf – “Encyclopedia” of recommender systems
http://www.slideshare.net/xamat/kdd-2014-tutorial-the-recommender-problem-revisited – overview of recommendation algorithms
http://www.machinelearning.org/proceedings/icml2007/papers/407.pdf – Restricted Boltzmann Machines for Collaborative Filtering

Ready for use recommendation engine:

https://cloud.google.com/prediction/ – Google recommendation engine
https://mahout.apache.org – Apache recommendation and general purpose machine learning framework

Elastic Search – just a few useful snippets

Install head plugin or rely on oldschool curl utility in order to test your queries:




Q: Show me example of query for complex, nested document?


{ "query": 
    { "bool":
        { "must": [
                 {"path": "document.sub_document",
                           {"must": [
                               { "match":
                                   { "document.sub_document.attribute": "PUT_YOUR_SEARCH_VALUE_HERE" }

NOTE: if what are you searching for in sub-sub-sub document – just add proper number of nested chains of “bool” “must” “nested” elements.

Q: I need full text search and aggregations (aka facets) by attribute in nested document.


{ "query": 
    { "query_string": 
        { "query": "PUT_YOUR_SEARCH_STRING_HERE" }
                {"path": "document.SUB_DOCUMENT"},
                            {"field": "document.SUB_DOCUMENT.ATTRIBUTE_NAME"}

Q: I need a full text search and aggregations by geo positions aka distance range.

NOTE: put proper values for “origin”, “field”, “ranges” fields.

{ "query":
    { "query_string":
        { "query": "PUT_YOUR_SEARCH_STRING_HERE" }
                {"path": "document.SUB_DOCUMENT"},
                            {"origin": "100500, 100500",
                             "ranges": [{"to": 1000}, {"to": 3000, "from": 1000}, {"from": 3000}]

Q: I have fields in document that contains multiple words, I want them to be be aggregated not as separate single terms, but as a whole string.

A.1 put proper mapping for such field – “multi_field” or in most recent version of elasticsearch – just “fields”.

... document mapping, ...

YOUR_FIELD: {   "type": "string",
                    { "type": "string", "index": "not_analyzed" }

... document mapping, ...

A.2 use such kind of queries for nested faceting:

        {"query": "PUT_YOUR_SEARCH_STRING_HERE"}
                {"path": "document.SUB_DOCUMENT"},
                             {"field": "document.SUB_DOCUMENT.ATTRIBUTE_NAME.raw"}

Q: I want return all documents sorted by distance and my geo_point field in nested document.

NOTE: ATTRIBUTE_NAME should be mapped as geo_point at moment of writing – it can be done only via manually created mapping.


    {"match_all": {}},
    "sort": [
                {"lat": 25,"lon": 55},
                 "order": "asc",
                 "unit": "km",
                 "distance_type": "plane"

Q: I want to return aggregation only?


{"size": 0,
    "aggs": {
        "name_of_1st_parent_aggregation": {
            "nested": {"path": "document.SUB_DOCUMENT"},
            "aggs": {
                "name_of_1st_aggregation": {
                    "terms": {
        "name_of_2nd_parent_aggregation": {
            "nested": {"path": "document.SUB_DOCUMENT_1"},
            "aggs": {
                "name_of_2nd_aggregation": {
                    "terms": {
        "name_of_3rd_parent_aggregation": {
            "nested": {"path": "document.SUB_DOCUMENT_2"},
            "aggs": {
                "name_of_3rd_aggregation": {
                            {"origin": "100500, 100500",
                             "ranges": [{"to": 1000}, {"to": 3000, "from": 1000}, {"from": 3000}]

Q: I want autocomplete?

A-0. NOTE better to use n-gramm based approach

A-1. Prefix approach for autocomplete:

    "query":{"query_string" : {
        "default_field" : "field.name",
        "query" : "start_of_phrase*"

A-2. by adding to document mapping additional suggest field:

"mappings" : {
    "document_type": {
             "suggest" : {
                        "type" : "completion",
                        "index_analyzer" :"simple",
                        "search_analyzer" :"simple",

When you add document for indexing you have to specify this additional information and use special endpoint _suggest for request suggestions:

    "suggest_name" : {
        "text" : "k",
        "completion" : {
            "field" : "suggest"

Q: I want filtering of a search result by nested attribute!

  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      "filter": {
        "or": [
            "nested": {
              "path": "document.nested_attribute",
              "filter": {
                "bool": {
                  "must": [
                      "terms": {
                        "document.nested_attribute.attribute_value": [
            "nested": {
              "path": "document.nested_attribute_1",
              "filter": {
                "bool": {
                  "must": [
                      "terms": {
                        "document.nested_attribute_1.attribute_value": [
                          "some string value"

Heterogeneous vector in c++ – overview of common approaches

So, you are wondering about heterogeneous vector in c++?
Maybe even dare to dream about any suitable substitution of such non-existent container?
In another word, you need a generic-like container that can store different datatypes.

If you just need a quick answer – stick with std::vector < boost::any> approach or read this,
If you need more technical-rich overview of purely templated solution – scroll down to links part,
If you are wondering about other options and don’t mind to dive in a world of bad English grammar and details about one of my recent task – read on.

heterogeneous container in c++

So, you want a heterogeneous container in c++…

First ask yourself – do you really need a heterogeneous vector?
If your answer is – yes, I have a bad news for you – in 99.9 percent it’s just consequences of messy design.
However, I am sure you are here for the sake of that one exceptional case: for example – your task is providing intermediate peace of software for interaction with some 3rd party old-fashion engine.

In my case – I was trying to implement convenient way of operating variable length list of parameters for OpenCL kernel wrapper.

If you are not familiar with mechanism of interaction of OpenCL code with C++, there are only one problem (sarcasm!) – it is too verbose. Off course there are numerous third party wrappers – http://streamcomputing.eu/knowledge/for-developers/opencl-wrappers/ but in situation where even NVidia drivers sometimes do not support all features from those-before-the-last standard,
I am afraid to think about cases when you have to deal with additional layer of external api.
Yeah, lets think about your own implementation because development of your own bugs is very entertaining and educational. It’s time consuming as well as terrible error-prone but you can narrow desired functional for your needs and be sure that all issues are made by you.

OpenCL kernels are compiled independently of host code, so you do not have any standard approaches for checking whether arguments provided to kernel have appropriate types and whether their amount corresponds to definition of kernel. It means that:
1) In case you are too lazy to somehow analyze every OpenCL kernels you can’t check how many arguments is necessary for particular kernel
2) You can’t check whether provided arguments have proper data type without any external parser

As a consequence, in general, my wrapper should be able to deal with variable length array of arbitrary any-type parameters.

(NOTE: Yeah, I’m familiar with undocumented c++ wrappers based on variadic templates, but it force you to follow their low-level nature by falling down from level of domain-specific objects to operating in terms of POD types.)

From that brief idea, I conclude that my goal was:

vector < gpu_arg > kernel_args;

where gpu_arg is a class that can encapsulate any data types – built-in as well as a user-defined.
Who have mentioned templates? How to create vector that can hold any templated parameter? (we discuss it a bit latter)

I approach – return to the ancient times
The most straightforward way – forget about C++ and rely on encapsulation of data into void* pointers with numerous C-way casting:

struct gpu_arg
  void* data;
  size_t size;
  // numerous helper methods here
  // NOTE: you have to add some kind of type_id to deal with data in proper way

I.e. kernel parser report that there should be following parameter set:

float, int, custom_class *

and when you start adding parameters it treat them as predefined types (with or without your own additional datatypes checks).

As an advantage of such idea – we can easily use it in run-time.
In addition, this solution is error-prone and can lead to cruel punishment during any code review.

II approach – std::tuple and variadic templates
On the other hand – in many cases when you are not keen to find a perfect silver bullet you may find helpful to simplify task. In order to check argument’s types you have to preprocess kernel, so you can form an expected parameter list. If you perform this operation during compilation of host-side code, you may use acquired parameter list to simplify code-generation task.
I started investigation of possible approaches and find out that the most obvious solution would be based on std::tuple:

// this C++11 container allows creation custom container like this:
std::tuple < int, int, bool, float *, unsigned short * > parameters_set;

or encapsulate it in a class with some syntax sugar for convenience:

template < typename ...T >
class arguments_set

    std::tuple<T...> data;
    template< size_t I >
    using data_type = typename std::tuple_element<I, decltype(data)>::type;

     *      variadic-based routine for initialize every element of tuple
     * */
   	template < std::size_t I = 0, typename TT >
	init ( TT & arg )
        std::get<I>( data ) = arg;

	template < std::size_t I = 0, typename TT, typename ...Args>
	init ( TT & arg, Args ... args )
		init <I,TT> ( arg );
		init <I+1,Args...>( args ... );


    template<typename... Ts>
    arguments_set(Ts... args) {
       init <0,Ts...> ( args... );

    template < std::size_t I = 0>
    auto get ( ) -> data_type<I>
        return std::get<I>(data);


Argh templates, well, who care about compiler’s effort to parse all this fluff?
Despite of convenience of variadic templates constructor, template metaprogramming is not easy to deal with during maintenance phase (as well as during developing).
On the other hand, it provides a desired result – strong type-checking combined with ability to generate variable length list of parameters.

So, solution was:
1) run pre-processor for OpenCL kernels in order to generate proper tuple for method invocation
2) compile the whole module

It is can be a solution in situation where you are not intend to run this mechanism in run-time.
(because it mean you have to dynamically extend templated class tuple objects)
Not my case.
Idea of pre-compiled kernels (PTX) sound great, but reality is sad – mess of drivers, hardware and vendor’s ambitions lead to incompatibility of generated binaries in general case. Not usable for me :(.
(But hope springs eternal – if you are lucky enough you can play with CL_CONTEXT_OFFLINE_DEVICES_AMD, http://clusterchimps.org/ocltools.php, http://www.browndeertechnology.com/coprthr.htm )

III approach – type erasure

Ok, let’s return to my preconditions once again:
1) I need type checking – template?
2) I need it in run-time – maybe some virtual stuff?!

What if I declare interface of argument as an abstract class and inherit it as a templated child with proper data fields


// pure abstract class
// in case of necessity can be further divided on pure interface\data-fields parts

class gpu_arg_base {
   *  interface part that depend on child's template parameter = a lot of virtual functions
   *  common data fields with ordinary setters\getters methods

template < typename T>
class gpu_arg : gpu_arg_base {
   * explicit override of interface with possible overloading
  gpu_arg ( T* init_data);
    T *data; // NOTE: just a reference to data, no allocation\de-allocation!

It allow me to use them in following way:

class kernel_wrapper {
    vector <gpu_array_base*> kernel_params;
     /* some stuff here */
        // variadic template functions to deal with any number of parameters
        template < typename T>
	add_kernel_args ( T * arg )
		add_kernel_arg ( arg );

	template <typename T, typename ...Args>
	add_kernel_args( T * arg, Args ... args )
		add_kernel_arg ( arg );
		add_kernel_args ( args ... );

        // generating of proper function for particular type
        template<class T>
        void add_kernel_arg( T * host_data )
	        gpu_arg_base* new_arg = new gpu_arg<T> ( host_data );
		kernel_params.push_back( new_arg );

     /* interface part */

But be careful with this easy-looking approach – there are two main issues which can affect your mood and calmness.
First, read about differences between overriding and hiding of methods in inheritance hierarchy here or here. It is a great source of confusion, especially during investigation of fresh bug-reports.
Second, do not forget about “covariant return type” rules – http://aycchen.wordpress.com/2009/08/17/covariant-return-type-in-cpp/.
Great article about possible caveats and workaround can be found there:

After reviewing all solutions described above, you may find that you accept additional dependency in exchange for absence of disastrous side-effects of your implementation.
boost::any or boost::variant can be a proper choice.

PS. Actually, I suspect that using tuples and dark magic of template metaprogramming, you can save few ticks of processor’s time by abandoning inheritance and virtual table, but as usual during development we have to balance between concept of the perfect code and requirements of too fussy world.


Example of heterogeneous container ( deeply nested approach)

Interesting practical example of tuple usage for ORM-like engine:
Some practical aspects of using tuples:

Using variadic templates to initialize tuples or other way round:

Any class in c++:

Illustration of variadic templates usages for generating C++11 variant class:

Hands-on experience with tuples:

How to convert png pair of RGB and Depth frames into Pointcloud library PCD format

There are a lot of accessible dataset of RGB-D data:


But usually it stored in PNG format and unfortunately Pointcloud library do not provide built-in function neither for treating it as a PointCloud nor for conversion it to PCD. For my experiments I need to test few points using data with ground truth estimation, thats why I have to code small utility for conversion purpose.

Due to sudden leisure I decided to share that peace of code – probably you can find it useful.


From readme:

png2pcd_batch – simple command line utility to convert depth and rgb frames
from png format to PCL pointcloud.

There are 2 execution mode:
using file with association information i.e. containing rgb-depth png files correspondence
or just providing folders that contain depth and rgb frames ( not reccommended ).

In 1st case you should anyhow create associate file by yourself
(for further details check description of parse_freiburg function)
In 2nd case – correspondence strictly depends on file names, and you should check it twice,
to avoid situation when selected depth frame is not appropriate for rgb frame.
( add sorting to filenames vector using custom predicate )

All dataset related parameters are incapsulated in Intr structure ( intrinsics ).
There are: width, height, fx, fy, cx, cy, scale_factor.
Usually depth data is saved as unsigned short ( 16 bit ),
but in pcl::PointXYZ you have to re-scale it to float – metric measurment.

Appropriate intrinsics should be written to file cam_params.cfg otherwise
default values will be used ( which may lead to invalid output data ).

There are exist two opportunity for compiling:

using classical make:
edit WORK_DIR in Makefile to point in directory contained pcl-trunk & opencv
it produce more lightweighted version by avoiding linkage with unnecessary libraries

or using cmake:

mkdir build; cd build
cmake ..

NOTE 1: There are only two dependencies:
PCL and OpenCV.
NOTE 2: in case of builded-but-not-properly-installed OpenCV libraries you have to
manually create symlink to ipp lib:
sudo ln -s /path-to-opencv/3rdparty/ippicv/unpack/ippicv_lnx/lib/intel64/libippicv.a /usr/lib/libippicv.a

Note 3: it have built-in support for 16bit unsigned depth and 3-channel RGB data only, in case your data has another format you have to change code a bit
Note 4: do not forget to provide appropriate intrinsics for proper calculation of XYZ vertex