With this article I want to go a little bit further than widespread memos “X most common data structures for interviews”. I am not only interested in information about asymptotic complexity. The real value – understanding whether particular algorithms or data structure can be beneficial in particular case: real life tasks rarely given with hints like “Use directed graphs, Luke!” or which comparison function should you implement to have your favorite sort algorithm work accordingly to upper bound complexity from textbooks. On this page you can find not just theoretical high level summary but also few facts that I am considering important, practical or just afraid to forget.
Few considerations worth to notice before speculating about “Best” sorting algorithms:
- How much data we are dealing with?
- Dozens numbers or billions?
- What is nature of data – i.e. what is cost of element comparison?
- Integer\strings\complex objects? For example it may involve disks seeks and we should aim for algorithms that minimize comparison or even stick with non-comparison based algorithms?
- What is the cost of elements transfer?
- Should we aim to algorithms with minimum number of swaps/data copying
- What is current state of data? Does it partially sorted?
- In case of partially sorted data some algorithms may touch they upper bounds but adaptive class of algorithms can deal with this input in efficient manner
- Are data allow random access?
- For example: linked list vs arrays. Classical quicksort on linked list always be inefficient.
- Are data fit to the memory?
- Otherwise we need modifications of algorithms that support external sorting.
- Are we are dealing with memory constrained environment?
- Can allow auxiliary space or have to stick with in-place algorithms?
- Is there possible duplicative values in array to be sorted?
- Do we need stable sorting – i.e. should they keep their initial order from input array?
Comparative based algorithms – i.e. we compare the whole values of input array (as opposite to distinct bits of those values). Any comparative based sorting algorithms lower bound worst case complexity – N log (N).
Comparative based algorithms – out of our league – not used in reality, but sometimes can be asked during interview
|Algorithm||Time complexity||Space complexity||Comments|
|Bubble sort||O ( N )||O ( N ^ 2 )||O ( N ^ 2 )||O ( 1 )||
|Select sort||O ( N ^ 2 )||O ( N ^ 2 )||O ( N ^ 2 )||O ( 1 )||At i-th iteration find index MIN of the smallest remaining entry and swap it i-th element with MIN.
N Exchange. (Stable for list.)
Practical comparative based algorithms
|Algorithm||Time complexity||Space complexity||Comments|
|Quicksort||O (N Log N)||O (N Log N)||O ( N ^ 2)||O ( Log N )||O( Log N ) probabilistic guarantee.
Pick a pivot, partition array in order to group all elems less than pivot in the left subarray, and those which are bigger – to the right.Recursively apply.
Pivot may be chosen randomly but there are better strategies like median of medians.
|3-Way Quicksort||O (K * N)||?||O ( (2 ^ k) * N)||O ( Log N )||Used in case of duplicate keys.
Pick a pivot element and partitions elements into three sets: less, equal or greater than pivot.
There are variants with two pivots.
|Mergesort||O (N Log N)||O (N Log N)||O (N Log N)||O ( N )
NOTE: For linked lists:
|Divide array into two sub-arrays and sort them recursively and merge in sorted order.
In simplest version – till sub-array have size 1, in more optimised – till it reach size of threshold (10-20 elements) and another sorting algorithms can by applied (insertion sort for example).
Easy modification for parallel and external sorting (aka not-fit in memory).
Less comparison than quicksort.
More element transfer than quicksort.
|Heapsort||O (N Log N)||O (N Log N)||O (N Log N)||O ( 1 )||Construct a max heap mapping array’s indexes to binary search tree.
After constructing recursively delete root from tree until it is not empty.
|Insertion sort||O ( N )||O ( N ^ 2)||O ( N ^ 2 )||O ( 1 )||At i-th iteration swap i-th element with each larger entry to its left.
Useful for partially sorted array or small arrays – less than 10 elements.
|Shell sort||O ( N )||?||O (N ^ (3 / 2))
O(N * (Log N) ^ 2)
|O ( 1 )||Like insertion sort but we rely on D-order sort, where D – gap between sorted entries.
All estimations depend on choosing gap sequence.
Used in embedded devices due to non-recursive nature – no need for support deep and memory hungry call stack).
|Timsort||O ( N )||O (N Log N)||O (N Log N)||O ( N )||Based on Merge sort and Insert sort:
tries to find a consecutive sequence of ordered elements and merge them to decrease necessary processing to final results.
Used in Java and Python as default sorting algorithm.
|Introsort||O ( N )||O (N Log N)||O (N Log N)||O ( N )||Hybrid between quicksort & heap sort.
Used in c++ as default implementation of sorting.
|Treesort||O (N Log N)||O (N Log N)||O (N Log N)||O ( N )||Build binary search tree from elements and traverse it in-order.
Variations with self balancing binary search tree or Splay tree (Adaptive).
Estimations based on usage of balanced trees.
|Cubesort||?||?||O (N Log N)||O ( N )||Parallel sorting algorithm
Non–comparison based sorting algorithms. For such kind of algorithms we do not compare the whole keys (values) but using individual bits of values (for example characters for strings or digits for numbers) – it allows to achieve linear time O(N) complexity. Price for it – necessity to tune it for every new type.
|Algorithm||Time complexity||Space complexity||Comments|
|Radix sort||O ( N * K )||O ( N * K )||O ( N * K )||O ( N + K )||Examines individual bits of keys.
K – max number of digits in array’s elements.
Applicable for integer numbers, string and float numbers in special formatting.
Worth to use in situation when N >> K and K is fixed – famous question about sorting of million of 32-bit numbers.
MSD vs LSD – most significant vs least significant digits.
|Bucket sort||O ( N + K )||O ( N + K )||O ( N ^ 2 )||O ( N * K )||Partitioning an array into a number of buckets.
Each bucket is sorted either by recursive application of bucket sort or using other algorithms.
Can be used as external sorting algorithms.
Asymptotic given for cases when K ~ N.
Applicable in case data evenly distributed over a range.
|Counting sort||O ( N + K )||O ( N + K )||O ( N + K )||O ( N + K )||Count number of keys related to item, use this count to determine position of elements.
Asymptotic given for cases when K ~ N.
Applicable in case N ~ K.
N-th biggest\smallest element
just a variations of partial sort:
- QuickSelect – worst case O(n^2) best case and average – O(n) – as quicksort with pivot based partitioning but move in one direction only
- Median of medians – improve worst case of quickselect to O(n) – partition input to group of several(5) elements, find median within every group, and return median of n/5 medians
- Introselect – back up nth_element at C++ STL – start as classical quick select and switch on median of median method in case depth of recursion is too deep and at the same time sizes of sub-partitions observed so far not leading to halving.
Family of hashing functions: H(X) = ((a * x + b) mod p) mod N) a,b < p, a!= 0, p – prime number used to reduce number of collisions (it depend on your keys distribution).
Two main approaches: separate chaining vs open addressing (linear probing) with double hashing.
- Separate chaining – array of lists, at the very worst case – O(N) to retrieve record!
- Open addressing – circular buffer – i.e. in case of collisions try to insert record to next un-occupied cell – which lead to poor performance in case of bad choice of hashing function. Load factor – how it will perform if 25\50\80 % of cells of array will be occupied. Used for hashtable implementation in python.
Separate chaining advantages:
- Easy implementation of delete method
- Performance degrades gracefully
- Clustering is less sensitive to poorly designed hash functions
Linear probing advantages:
- Less wasted space (pointers)
- Better cache performance
There are various use cases & requirements for hashing that affect characteristics of good hash function:
- Quickly insert\return records with mapping Key to Value – cheap cost of computing hash and low number of collisions
- Find similar or close entries – need to maximise collisions if keys meet some criteria – locality sensitive caching (for example GeoHash) with rolling hash to speed up computations, another famous example: Rabin Karp string search algorithm.
- Make it difficult to find key corresponding to particular hash value – one way function – Cryptography related hashing – fixed size output for any input, no collisions and at the same time almost impossible (time and computation consuming wise) to find input that produce particular hash
- Cornerstone of more advanced data structures: bloom filter, merkle tree, distributed hash table.
- Interesting overview of details of hash implementation in various languages can be found here.
Root always contains element that satisfy heap property: min-heap or max-heap. Elements not sorted but partially ordered. Heaps may be constructed in O(N) using Floyd algorithm. Usually used as underneath structure for priority queues. Leftist heap as example of pure functional data structure.
|Heap’s underneath datastructure||Time complexity of operations||Comments|
|Find Max||Extract Max||Increase Key||Insert||Delete||Merge|
|Linked List (unsorted)||O ( N )||O ( N )||O ( 1 )||O ( 1 )||O ( 1 )||O ( 1 )|
|Array (sorted)||O ( 1 )||O ( 1 )||O ( N )||O ( N )||O ( 1 )||O ( N + M )|
|Binary Heap||O ( 1 )||O (Log N)||O (Log N)||O (Log N)||O (Log N)||O (N + M)||Heapify: O(N)
Binary Heap (array representation) – ordered complete binary tree – parent’s key no smaller than children’s key. Complete binary tree.
|Binomial Heap||O (Log N)||O (Log N)||O (Log N)||O (Log N)||O (Log N)||O (Log N)||A binomial heap of order K has a root whose children are roots of binomial trees of orders K-1, K-2, … , 2, 1, 0 – in this order.
Each binomial tree is a heap obeys the minimum-heap property: the key of a node is greater than or equal to the key of its parent.
There can only be either one or zero binomial trees for each order, including zero order.
Support quick merging of two heaps.
|Fibonacci Heap||O ( 1 )||O (Log N)||O ( 1 )||O ( 1 )||O (Log N)||O ( 1 )||Collections of trees.
Used for priority queue.
Degree of nodes (degree means the number of children) are kept low;
every node has a degree at most O(Log N) and the size of a subtree rooted in a node of degree K is at least F(K) + 2, where F(K) is K-th Fibonacci number.
Amortised constant factor affect practical usage. High memory consumption.
Full tree – every node except leafs have non-null children. Complete tree – full tree that have all nodes at the last level are lean left.
- Pre-order – via dfs
- In-order – via dfs
- Post-order – via dfs
- Level-order – via bfs
Search data structures
- Size of element (read as cost of access\copy\compare operations)
- Volume of data: can it fit to RAM?
- Usage patterns:
- evenly mixed read & writes or skew for particular operations?
- access single key or range?
- access to value using the whole key or interested to find all values where even parts of key matches?
|Data structure||Average time complexity||Worst time complexity||Auxiliary Space||Comments|
|Arrays||O ( 1 )||O ( N )||O ( N )||O ( N )||O ( 1 )||O ( N )||O ( N )||O ( N )||O ( N )||Cache friendly, low memory overhead.
Pre-allocated vs dynamic.
|Linked list||O ( N )||O ( N )||O ( 1 )||O ( 1 )||O ( N )||O ( N )||O ( 1 )||O ( 1 )||O ( N )||Size of data vs size of pointer(s).
One directional vs bi-directional vs circular.
Sequential access vs random.
Internal or external storage (i.e. node contains only reference to actual data)
Foundation of stack, queue and dequeue.
|Skip list||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( N )||O ( N )||O ( N )||O ( N )||O ( N Log N)||Allow fast search using additional references hierarchy: next, the next after next, every 4th, i-th level – should contain 2 ^ i elements, but it used probabilistics skips – i.e. at i-th level, next reference may point to 2^i element or further.
Alternative for hash table and balanced tree – theoretical worst case performance can be worse, but in practice usually faster – depend on nature of data and cost of comparison operator.
Foundation of various lock-free implementations for dictionaries and priority queues. Used in Redis, LevelDb and Lucene.
|Hash table (symbol table, associative array)||?||O ( 1 )||O ( 1 )||O ( 1 )||?||O ( N )||O ( N )||O ( N )||O ( N )||Not best in case of small number of keys in comparison to direct mapping key -> array index. In practice, particular operation can be slow because of array resizing and cache un-friendly nature may lead to poor performance in comparison even with O(N) brute force lookup of simple arrays. C++: unordered_map, multimap
|Binary Search Tree||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( N )||O ( N )||O ( N )||O ( N )||O ( N )||Complete if it perfectly balanced, except for a bottom level. In worst case (depend on insertion order) it can be N-height. Constructing as quick sort. Application of problem: 1D-range search.|
|Cartesian Tree||?||O ( Log N )||O ( Log N )||O ( Log N )||?||O ( N )||O ( N )||O ( N )||O ( N )||It is heap-ordered and symmetric (in-order) traversal of the tree returns the original sequence. Used for RMQ (range minimum query) kind of tasks.|
|B-Tree||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( N )||
Not just two child anymore => heigh of tree lower => lookup faster.
|2-3 Tree||O (C * Lg(N))||O (C * Lg(N))||O (C * Lg(N))||O (C * Lg(N))||O (C * Lg(N))||O (C * Lg(N))||O (C * Lg(N))||O (C * Lg(N))||?||Maintain symmetric order and perfect balance.|
|Red-Black Tree||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O( N )||
Balance is preserved via painting different nodes in Red\Black colours to satisfy some predefined properties. Re-arranging and re-colouring can be performed efficiently.
For example: no node has 2 red links connected to it. Red links lean left. Every path from root to null link has the same number of black links – perfect black balance.
C++: set and map.
Java: TreeMap and TreeSet (and in some situation HashMap fallback on their usage as well!)
|Splay Tree||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O( N )||
Splay Tree is a self-adjusting binary search tree with additional property that recently accessed elements are quick to access again.
Applications: cache, garbage collectors, memory allocators.
|AVL Tree||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O ( Log N )||O( N )||Balance in AVL Tree – the heights of two child subtrees of any node differ by at most one. AVL Tree better for look-up intensive applications than Red-Black Trees|
|Kd Tree||?||O ( Log N )||O ( Log N )||O ( Log N )||O(N Log N)||O ( N )||O ( N )||O ( N )||O( N )||
Space partition Binary Tree. Used for range search and nearest neighbour search. Constructing N Log N. Not suitable for high dimensional space – better to use when N >> 2^k.
Octree for 3D space.
More advanced structures worth to be aware of:
- Trie (Prefix Tree) and its variations, mainly Radix Tree for lookup intensive applications. Worst case – O(M), where M is the length of string to search. DAFSA – can be a great alternative for cases with limited memory.
- Generalized Suffix Tree – search efficiently with mismatches, applications for various string related tasks: longest common substrings, longest repeated substrings, longest palindromic substring,
- Fenwick Tree (Binary indexed tree)– common use case: modify element of array and return result of some inverse operations for range of elements (for example partial sum but not max). Simpler and faster in comparison with segment trees. O(N) build time.
- Segment Tree – classical approach for RMQ kind of tasks, operation not necessary to be inverse. Double memory requirement in comparison to fenwick tree. Constant factor affect speed in practice, especially in cases of multiple dimensions.
- Implicit Treap – sequence of pairs, heap ordered by 1st entry, BST ordered by the 2nd. If we use index of element in array as first entry – than we can
- insert\remove element at any position
- move subset of elements in any position
- computation of some function for elements within some interval [i, j] – sum, min – including range updates using lazy propagations
- foundation for Rope implementation – to have similar asymptotic for very long strings
- and all of the above in O(Log N) – with huge constant factor though
- vEB Tree – ordered dictionary or associative array much faster (exponentially) than BST, memory overhead
- Heavy-Light Decomposition: split tree on multiple not-crossing paths. Edge from A to B is considered heavy if number of vertexes reachable from B is at least half as big as number of vertexes reachable from A. Heavy-Light Decomposition – union of light vertexes from bottom of tree to the root. Such decomposition allows us to serve queries like “Get some metric on the path between A and B” – for example sum or max. Not needed in case we do not have update queries. Usually used with segment tree and LCA.
Union Find (Disjoint set)
Approaches: colouring, trees on top of arrays using ranks\randomisation\path compression & merging either randomly or using ranks of trees. Additional information can be stored with leaders of set.
Applications: offline LCA, offline RMQ, dynamic connectivity variables dependencies during program analysis, variations of percolation problem – connections in social network, fluids or electricity flow.
Representation: incidence matrix vs adjacency list. Generally adjacency list is preferable for sparse graph, but there are variation of incidence matrix based on associative array where pair of vertexes used as a key.
Cut – set of vertexes from graph, if they will be removed – graph become disconnected. Size of min cut – connectivity or K-connectivity.
Graph related tasks
- Find cycles in graph – union find (complexity – Akkerman function from number of vertexes) or DFS – proportional to number of vertexes
- Determine whether graph is bipartite – dfs
- Eulerian path – route visiting every edge exactly once (exist if all vertexes has even degree and it is connected) – search of such path can be done in linear time O(|E|)
- Hamiltonian path – route including every vertex exactly once (as in Travelling salesman problem)
- Comparison of graph – graph’s isomorphism (still open question N or NP)
- Graph representation in such way that there are no edge crossing, except in vertexes (upward planar graph) – there are some special cases where such checks can be done in O(N) but generally it is NP-hard problem.
- Minimum Spanning Tree (MST) – application is design of network (For example to avoid cycle of packets in Ethernet) or clustering, with custom distance function.
For directed graphs:
- Topological sorting – tasks with ordering limitations, schedule related tasks (DFS) – create schedule in case we do not have cycle – re-draw graph in such way that all path follow from bottom to top – reverse dfs postorder (additional stack) = result
- path finding – Dijkstra, if weights are negative – Bellman-Ford or its optimised version – Shortest Path Faster Algorithm (SPFA)
- Shortest path for multiple sources – dense graph with negative weights – Floyd-Warshall or not dense – Johnson’s algorithms or Dijkstra
- Search for cycle – DFS – detection of deadlock in concurrent systems or topological sort
- Maximum flow – Ford–Fulkerson algorithm , Bipartite matching problem, baseball elimination
- Dynamic connectivity
- 2-SAT – satisfy system of constraints of boolean variables – scheduling, medicine testing. Can be represented as directed implication graph. Checking whether two vertexes are within the same connected component allow to answer question whether we can satisfy all conditions.
- Knuth-Morris-Pratt (KMP) – based on DFA (deterministic finite state automaton) that have to be pre-computed for pattern. O(R + N) where R – length of pattern and N – length of text. O(RN) space.
- Boyer-Moore – scan from the end of pattern and in case of mismatch skip up to R characters – worst case is as brute force O (R * N) but on average considered the most efficient algorithms – O (N / R). O(R) space.
- Rabin-Karp – compute sliding hash for every substring of length R for hashing can use modular hashing with some prime number. For efficiency sake Horner’s method can be utilised to re-use computation from previous steps – O(R). Hash match – still can be a collisions so there are two variations in case of hash matches. 2D search or search for multiple patterns using pre-computed table of hashes. O(1) space.
- Aho-Corasick – in case we are dealing with multiple patterns and text where we search them are different. Build trie for all words, extend it to special finite state machine without backtracking – for failed patterns using suffix links. Used in grep – -F flag.
- Z-functions: Given string A, consider array of all possible suffixes: A[len(A)], A[len(A) – 1], … , A. Compare symbols from the start of string and every suffix – number of matched symbols – represent particular component of Z-function. For sub-string search: concatenate pattern, unique delimiter and original string and check if values of Z-function equal to length of pattern.