Architecture of modern startup

hype wave, pragmatic evidence vs need to move fast

Tech side of startups sometimes can be very fluid and contain a lot of unknowns. What tech stack to use? Which components might be overkill for now but worth keeping an eye for in the future? How to balance the pace of business features development while keeping the quality bar high enough to be able to have a maintainable code base?

Here I want to share our experience of building https://cleanbee.syzygy-ai.com/ from the ground up – how we shape our processes based on needs and how our processes evolved as we extended our tech stack with new components.

Business want to conquer the market, engineers – try cool stuff and stretch their brains. Meanwhile industry produces new languages, frameworks and libraries in such quantities that no way you will be able to check them all. And, usually, if you scratch the shiny surface of NextBigThing you will find a good old concept. Good – if you are lucky.

One of the most exciting topics to argue about – is the processes – whether you rely on trunk-based development or prefer more monstrous github flow, are you fan of mobbing or find it more efficient to spend time in PR-based code-reviews.

I have experience working in an environment where artifacts were thrown away on users without any standardized process, and, in case of issues, developers had a lot of fun (nope!) trying to figure out what version of components was actually deployed.

On another spectrum is the never ending queue to CI – after you create PR you have to entertain yourself in the nearest 30 min by betting whether CI cluster will find a resource to run tests over your changes. Sometime the platform team introduced new, exciting and certainly very useful features, that might break compatibility with existing boilerplate for CI, that resulted in failing all your checks at the last minute, after an hour of waiting.

I have a strong belief that as usual it all depends – on team maturity, kind of software you are building and various business constraints a-la existence of error’s budget and importance of time-to-market versus SLXs.

I think what is really important – have some agreed process in place, that everyone is aware of and follows, as well as having balls to challenge and change it, if there is evidence that there is a better alternative.

Start shaping the process

What we have at the start:

  • less than dozen developers – in house-team and temporary contractors
  • completely greenfield project – no single line of code is written yet, requirements a big vague, but already started shaping into something
  • tech wise – clear need for backend that should talk with mobile clients
  • and some simple web frontend – static pages should be enough! (nope)

We have started simple – code at github and PR based flow with single requirement – to have tickets splittable in order to be delivered in 1-3 days. This required some practice of story slicing and it seems that sense of visible fast progress ~ ability to move tickets to Done, can be a great motivational factor for the team to onboard of that idea.

Linters and static analyzers to skip exciting discussions a-la how many arguments per method are too much (6!), gradually adding auto-tests. We also try codesense – they have very promising approach to highlight important part of code (those bits that changed frequently – should definitely have a higher maintainability bar!) and identifying complexity by looking at the level of nestness in the code, probably it is a bit expensive for startups in initial stage – but 100% provide a decent level of hints for engineers.

On the architecture side of thing – there were temptation to dive deep in wonderland of microservices. but looking at horrifying diagrams of connections between them from big players, need to trace requests between them – it really seems suicidal approach for teams on early stage, that want to move fast.

Analysis of requirements allow us to detect three groups of job:

  • core API with usual CRUD like activities
  • search and recommendations
  • temporary workload that do something very useful according to schedule (almost at time with casual delays are ok)

Choice of tech stack – situations when time a bit limited and expectations are high – use what you know and master (yeah, maybe for some one it is boring technology): hence Fastapi, REST, stateless, python, redis and postgres our best friends (yeah, we like Go and Rust – but need to pay one’s dues a bit more!).

With mobile clients the situation was a bit different – we foresee a lot of screens with states, interactions with remote services but not too much custom, platform specific tweaking – hence the idea of having a single code base for both iOS and Android was very appealing.

Nowadays the choice of frameworks is really wide – but again, due to some experience with flutter we decided to give it a go. Within mobile development – one of the important aspects to better decide on is state management – and here you will have a nice abundance of the acronyms to be puzzled about from various languages and frameworks – MVC, MVVM, VIPER, TCA, RIBs, BLOC, . Our moto – start with the most simple (*) solutions sufficient to support necessary functionality. (*) Simple – well, lets put it this way – we think that we understand it.

However, we definitely make a mistake after building MVP – decided to build on top – instead of throwing it away. Hence at one wonderful (nope!) sunny day I was questioning my sanity: when after I commented out code, clean all possible caches and still doesn’t saw my changes in new screen. Yeah, dead code should be removed!

Start building!

After those initial formalities were settled, the next thing that was necessary – to be able to check client-server interactions.

API contract is for sure a great thing – but it will be much more obvious that something is wrong when you have a real server throw on you “schema validation error” or miserably fail with HTTP 500 error code.

Backend services were initially split into two groups – API monolith and Search & Recommender. First contains more or less straightforward logic to interact with DB, second contains CPU intensive computations that might require specific hardware configuration. Every service – its own scalability group.

As we were still thinking about rollout strategy (and arguing which domain to buy) – solution was simple: to minimize struggles of mobile engineers of dealing with backend = alien stack – lets pack everything into docker.

When we prepare everything to be deployable locally – mobile engineers can run docker-compose commands and have everything ready (after a few painful attempts that reveal flaws in documentation – but the real value of such exercises is to react to every “WTF!11” and improve it).

`Everything` is good, but what is the point of an API running on top of an empty DB? Manually entering necessary data – shortly start leads to depression (and risk to increase duration of development cycles). Hence we prepared a curated dataset that was inserted into local DB to be able to play with. We also started using it for auto-tests. Win-win!11 Auth becomes less problematic in defining testing scenarios, when you have dozens of dummy users with similar passwords!

Try new things or choosing 3rd party providers

Dealing with new technology is always a bit dangerous – you and your team can’t know everything (and sometimes things that you think you know can full you, but that’s another story). And still it is often required to assess, investigate something that no one has touched.

Payments, email, chat, sms, notifications, analytics, etc – every modern application usually represents a bunch of business logic glued with a number of 3rd party providers.

Our approach to choosing with whom we work – time-capped try-to-build-with-it activities to try the most promising one chosen by features, supported languages and, in case of providers, pricing.

How did we got into terraform?

Backend, apart of DB also should have some object/file storage. Sooner or later we also should have DNS so our services are ready to play with the big cruel world.

Choice of cloud provider was again purely based on existing expertise within the team – we already use AWS for other projects – so we decided to stick with it. For sure it is possible to do everything in the AWS console – but as times go, things become a classic big ball of mud that everyone is terrified to touch and no one remembers why this bit exists at all.

Okay, seems the paradigm of infrastructure as code can be handy here.

Tooling wise choice are not so big – vendor specific (AWS Cloud formation, Google Cloud Deployment Manager, Azure Automation) or terraform and its rivals.

Based on experience with terraform… you already got the idea how we choose things? 😀

Yeah, initial setup will take some time (and without control can easily become the same big ball of mud in TF as well 😀 ) but at least it will have some documentation over infra and visibility WHY it is there. Another major advantage – whatever you manage through TF – will be updated automatically (well, when you or CI/CD run corresponding commands)

Secrets managements

For AWS itself given we run everything inside AWS we can rely on iam and assumed roles by attaching necessary policies to our VMs. But we need integration with 3rd party services, as well as some way to pass some secrets to our apps – for example password for db. We need some solution for secret management. AWS have KMS, github actions have its own secrets, and apart of it there are bunch of other providers, so real question is: what do YOU need from secret management:

  • audit
  • path based access
  • integration with Kubernets
  • ability to issue of temporary credentials
  • Web UI
  • Free
  • secrets versioning
  • … ?

KMS was very handy, and we managed to add it into github actions but the UI of vault and ability to use it for free (if you run it by yourself) was a kind of deal breaker on this matter.

Path to Kubernetes:

And once we have dockerized app – we have started considering Kubernetes as it offers few goodies out of the box – the most important one is to be able to spin up necessary amount of pods to meet performance demands and ability to define all your needs in declarative fashion – so, given sufficient level of automation no human being should run kubectl apply. AWS has EKS to start with that can be managed via terraform.

On the other hand – steep learning curve (to grasp the idea that it is exactly defined what should be up and running) and a bit of specific tooling to play with – were the fair reasons to think about it twice.

Helm charts.

If we talk kubernetes, and already have apps in docker that are released on every merge to main – helm chart become next steps in adaptation of modern infra stack: we have plugged AWS ECR to keep track of every new releases and publish helm chart in dedicated S3 bucket, that become our internal helm chart registry.

Plugging it at all together was not as straightforward as expected – kubernetes nodes initially can’t connect to ECR and pull necessary docker images, terraform modules (aws-ssm-operator) intended to work with secrets in AWS KMS was deprecated and didn’t support recent kubernetes API, secrets and config maps wasn’t in mood to be exposed into pods.

First rollout of services brings happiness to mobile folks – no need to care about instructions for local setup! Initial week or so though it was not really stable, but then – one less thing to care about.

Do you need all of it? Not necessary.

I must admit – this mix – kubernetes with vault via terraform & helm probably not for everyone and you most likely will not need it on the initial stage. Simple docker push to ECR on merge to main and doing ssh into ec2 && docker pull && docker-compose stop-start during release from CICD – can work well (at least for a happy path) AND will be clear for everyone from the first glance. That’s exactly how we re-deploy our static websites at the moment – ci build new version of it and just copy into corresponding s3 bucket.

Maturing the infrastructure

AWS is nice enough to offer credits for those who are crazy enough to explore shady paths of the startup world. Can we use it to save a few bucks on github minutes and expose less secrets and infras to github VMs?

How about self-hosted runners – i.e. when you open PR it is not Github VMs but your own Kubernetes allocate pod to run your CI checks? Sure, it is not easy to prepare everything for iOS releases (more about it below) but Android and backend surely should work on old good Linux?!

We have built it via dedicated k8s pods, but there is also an option to run checks on on-spot EC2 instances.

Observability and Co

There is a lot of marketing fluff around terms like monitoring & alerting.

In some companies those things are implemented just for the sake of bragging “We have X for that!”, although engineers are still blind to what is happening with their production, when there are real issues or alerts channels have to be muted as it contains non actionable noise.

And I must say – here we are still having a looong way to go.

First thing that you will find as soon as you search for those kind of solution is ELK stack and bunch of paid providers. Now, after measuring time and efforts to maintain our own setup – I start thinking that a paid solution might be really worth it. If and only if you really can delegate burden of squeezing the most important info about your apps and state of infra to existing solutions – it is all depends whether they have preset of metrics, log parsers and index mapping that you can easily adapt for your project.

For logging currently we rely on ELK. Yeah, it is more or less straightforward to setup and most likely there are people who find the query language of elastic very convenient to use on a day to day basis.

Here we are still exploring options – as it seems that old good kubectl logs with grep produce insights for questions like “What is the last error from app1 pods ” in a much more timely fashion, without being lost among endless UI controls. But most probably in the UI of Kibana still hide the levers that we should pull to add proper ingestion pipeline and choose corresponding mapping for elastic index for filebeat?

For alerting we setup prometheus and integrated it for the slack – again mainly due to the fact that we have experience with it beforehand.

Now, why with all of that we need Azure?!

As it usually happens when product evolving – new requirements introduce new kind of things

  • now apart of having something publicly visible we need some resources available for team only
  • to manage feature flags, access vault UI or struggle with elastic to figure out the last API error.

Sure, there are paid solution for that or you can mix some Identity as a Service providers (Azure active directory) for authentication your team mates with any VPN providers (we choose OpenVPN due to their free tiers) by exposing necessary services to internal network only so those who should can login using their credentials – and it has one clear advantage in comparison with using AWS stack – it is free (for limiting number of connections).

Okay, why do we need google cloud!?!

So far we are mainly discussing the backend part of things. But there are more. The thing that you see first – mobile apps! Flutter or something else – they also have to be builded, linted and tested. And published somehow somewhere, so stakeholders can immediately be in awe of the new features (and find new bugs).

For rolling out to the production – you would need to pass through a bunch of formalities (screenshots, change log = whats new, review) that will delay your audience from enjoying those pieces of art.

I must say that the API of stores is not really friendly for frequent release – when your build and sign app – publishing can easily take 15+ min. API of app stores – as every other API – may and will fail sooner or later. Yeah, and signing might be a nightmare as it is different between platforms. And it would be REALLY nice if engineers didn’t waste time on all of those things preparing releases from their laptops.

First (and probably single?) thing that you should consider is fastlane – initially I did have some prejudice with all those new terms like gems (like that name though!) and bundle, but it really works. Yes, in order to run them from CI some efforts will be required to deal with secrets jks for Android or match for iOS.

Towards the “dark” side

Next you will start thinking about app distribution: testflight is a handy tool for iOS world, what about Android? We endup using App Distribution – solution from Firebase – mainly because it worked for us after first try, but there are other options (that actually claim to be working for both platforms). What is important – you can do everything from fastlane! Even when your app evolving and you start adding various extras – analytics, chats, maps, geo, … – many of them were from Google directly of Firebase. As Firebase offering many goodies – it was natural steps to collect analytical events and after few tweaking with their IAM policy setup export of raw events into gs-buckets to be able to play with BigQuery.

Prod vs Staging – The Great Split!

For backend we have auto-tests right from the start and various practices like test double prove quite efficient to prevent regressions even in cases of complex business logic with integrations from side services. On the mobile side we were a bit limited due to coexistence of code from MVP and auto-tests were not so helpful for complex business scenarios like someone wanting to use our services but we can’t charge his bank card.

Manual testing was very time consuming and error-prone, especially when business logic dynamically evolved over time and state of data in the database after recent updates became become non-possible from the point of view of domain rules.

Yeah, so it would be nice to be able to run e2e tests clicking through the app with data that we are maintaining (and sure that it is valid). Would be REALLY nice if those tests don’t pollute the actual database, S3 buckets and 3rd party providers.

We have started with a single main branch and a single environment (rds, redis, k8s namespace and s3) that was used by first testers and developers. We were not exposed to the public, but as we move closer and closer to release it becomes clear that some kind of distinction is necessary in places where we can break things and have a stable environment.

In mobile applications it was matter to change URL of API during building. On backend – few aspects have to be done to support deploy-specific configurations: infrastructure-wise by creating dedicated policies and resources, and parameterized few bits in the code where specific URLs were expected. Apart of it there are several repositories, some of them independent but some are dependent – as in cases of shared functionality.

You know what will happen when you update shared functionality without immediate redeployment and testing all dependent apps? After a few days when you completely forget about it, you do some innocent – purely cosmetic changes, somewhere else in the dependent repo that will lead to redeployment and pull the latest dependency. Surely, during important demo right after it you would see some stupid errors related to lack of compatibility for single condition “that no way can happen” and you forget to double check.

  1. So first important consideration for spliting environment – automate overall rollout of all dependent applications if some base repo was updated, you may ask the team to do it and everyone agrees but forget to run pull.
  2. Second aspect – what do we actually need to deploy? Do we need to maintain all apps in every environment, including temporary jobs that are responsible for sending email or notifications? Seems that some kind of flag to include or exclude job into deployment might be helpful.
  3. E2E, and, later, probably Staging not necessary should be reachable by everyone on the internet.
  4. promoting new release to e2e and staging have to be automating
  5. promoting new release to prod, at least now, better have controlled and manual

Currently we have three envs, which fulfill all the things above:

  • E2E – environment where integration tests will be run on curated data to ensure base functionality is still in place
  • Staging – where core development is happening and where beta testers can try to break what we build
  • Prod – that happy to greet new users

Kubernetes cluster is still a single one – everything was split on the namespace level. Similar thing with RDS – where several databases co-living together in RDS instance.

On the side of automation of mobile testing choice is not really big. First thing that you have to choose – are you going to use any device-in-the-cloud provider or run tests by yourself? For sure you can plug a smartphone into a laptop and run tests – but wouldn’t it be nice (and right!) if CI will do it instead? When you start considering vendors that provide emulators and real devices to play with, you will find that the choice of testing framework for mobile is not really wide – but that the second choice you have to make (and choice of provider might limit you here). Another important consideration – is there any specific hardware requirements – i.e. using gpu or npu – hence any emulator was sufficient for us.

We identify two main options for the mobile e2e testing framework – flutter integration tests and appium based pytests. Firebase Test Lab were supporting flutter integration tests, although it required some tweaking to allow requests from their ip ranges (VM with running emulators) to reach our E2E API. Appium, apart of python api was very promising, as using something like a testproject (you guys rock!) – you can record all the clicks through the application per scenario – hence doesn’t require specific programming knowledge (but allow you to gradually learn it though). So far Appium is much more comprehensive in our setup in terms of scenario coverage.

E2E tests have one tiny (nope!) issue – cold-start of app in an emulator not very fast, if we add on top of it the time necessary to build the app and the duration of copying the debug build to the provider – it becomes a real bottleneck of moving fast.

So far we experimenting with running them twice in day – but lets see how is going.

What’s next?

Many interesting tasks are still on our todo list:

  • on infra side – performance testing, security testing, trying out flutter for web
  • on the development side – serving and updating ML models for recommendation engine and prediction of cleaning duration, building cache with feature vector for recommendation, intermix of optimisation problems for matching engine: jobs scheduling and game theory, …

And the most important – nothing can replace real world usage – many crazy things you will be able to see only when you start collecting real data about user’s behavior – so we are looking forward to the upcoming launch!

Machine Learning at the edge

Few learnings on migrating computer vision pipeline from python prototype at AWS to cpp in smartphones

One of the projects that we have been working was Smart Mirror – we want to build a technology that will hint user how to looks better at the photo.

It has bunch of fascinating tasks to tackles – from philosophical – what is “better” and what does this “better” mean for particular user to more close to earth issues like scale-independent facial expression comparison between frames. So we decided to split it on several stages to roll out one by one and started with application that show realtime hints how to looks similar to the best photo that user choose by himself.

Currently, under the hood, our computer vision pipeline has 10 neural networks in place and can chew 10-20 frames per second – depending on smartphone’s hardware.

In this article I want to share not only what we are doing at Syzygy AI, but mainly highlight our journey of bringing machine learning from cloud to smartphones.

Data Science\AI is still a hot buzzword to attract attentions to what are you doing, but statistics shows that majority of prototypes doesn’t reach productions. Maybe something dangerous lurking behind curtains of golden cage of Jupiter notebooks?

When we started exploring idea – we started simple – lets run MVP on laptop and use webcam to grab video frames to play with!

We need real-time processing – speed, speed, speed – so lets use our skills in low level programming and do everything in cpp and maybe cuda!

There are mature libraries that you can utilize to plug required algorithm. And there are a lot of pre-trained models (pay attention to a license!) that can be comparably easy added together to sketch a prototype. If the thing that you want to do is already have been done – you just run face detector and after that face landmark detector and then … .

Couple of weeks after – time to share results with stakeholders – early feedback loop is important. So how they can check it out? Well, that’s easy – clone the code, make sure that you have a cmake and required dependencies installed…

You asking what is clone? Do you have a nvidia GPU card? No, we can’t prepare app for iPad yet. Ummm, yeah, let us think about something else!

We have measured time of frame processing and the biggest contribution to duration was inference time of models (and most likely IO to and from GPU) – hence pipe was re-written in python for the sake of development speed. Cloud providers were happy to share VM with powerful GPU – just pay money, it is easy! – so we don’t depend on user’s hardware anymore! Let’s prepare some simplistic frontend and expose UI via web page (but still work with web camera).

It was significant breakthrough for shaping the initial idea and, along the way, it reveals two issues:

1. Usually, ready to use model perform their best when the data, on which we run inference are similar to what was used for training. Hence studio quality portraits of white male in their 40s with beards might not be the best training set for use cases when your target audience is girl-teenagers taking selfi in environment with non-uniform lighting conditions. If what you want to do is not standard cat or dog classifier – you better be aware what ML is really about – digging deep into papers and be able to experimenting fast.

2. When your software is running in the cloud – it can bring all kind of surprises from pallet of distributed system challenges – from cpu steal time to client side data buferization before you flush it to the socket. And of course you have to setup compatible version of drivers and libcuda*.so libraries so your python code will work. Ah, yeah, hardware there might be a bit outdated in terms of computing capabilities in comparison with recent version of consumer video cards. As a result, classical dilemma of “It works on my machine” might be contradicting with user experience at clouds.

As for first point – solution was obvious (not simple though!): We need data. Specific data. So let’s crawl! With our own custom annotation – thanks to CVAT – you can go crazy there.

But second point poses a fair question – what options do we have in regards of productization of our solution. Or, to put it simply – is there any alternative for complex and expensive client-server system with meaty GPU servers?

At the end what we want – run dozens of neural networks with different architectures (few example of tasks that we are solving: facial landmark detection, head pose estimation, segmentations, lighting condition estimation) and classical image manipulations (resizing, cropping, filtering) with close to real time requirements.

It was time to look up – what actually we have inside modern smartphones? Can we offload some work to them? At least to decrease network IO?

The first bit that grab our attention – comparison of performance between Intel i5 and Snapdragon backed laptops: tldr; qualcom was surprisingly fast.

The second is AI benchmarks for mobile hardware – from brief view it is challenging to understand how to treat those numbers, but the fact that there is benchmark for mobile was quite intriguing!

Hence we dive deeper:

Let me add here exact extract from one of Jira ticket from 2020:

Mystery of TFLOPs: surprising power in our pockets

Snapdragon is not a single CPU but SoC – i.e. system on chip i.e. it contains various components.

Snapdragon 865

  • CPU: Kryo 585, Up to 2.84 GHz, 64-bit, instructions set ARMv8-A NOTE: derivative of ARM’s Cortex-A77 and Cortex-A55
  • GPU: Adreno 650, OpenCL 2.0 FP, Vulkan 1.1, OpenGL ES 3.2, DX12
  • DSP: Hexagon 698
  • RAM: LPDDR5, speed: 2750MHz

Apple:

  • 2019 A13: architecture – A64 – ARMv8.4-A, six-core CPU, 2 cores at 2.65 GHz + 4 energy-efficient cores
  • 2020 A14: architecture – A64 – ARMv8-A, six-core CPU, 2 up to 3.1 GHz + 4 up to 1.8 GHz, LPDDR4X

Huawei:

  • end of 2019: Kirin 990: 4 ARM Cortex-A76 + 4 ARM Cortex-A55 Cores, Mali-G76 M16
  • 2020: Kirin 9000: 4 ARM Cortex-A77 up to 3.13 GHz + 4 ARM Cortex-A55 Cores up to 2.05 GHz, Mali-G78 MP24

Samsung:

  • Q4 2019: Exynos 980: 2 ARM Cortex-A77 + 6 ARM Cortex-A55 Cores, LPDDR4
  • 2020: Exynos 990: 2 custom up to 2.73 GHz + 2 ARM Cortex-A77 up to 2.5 GHz + 4 ARM Cortex-A55 Cores up to 2 GHz, Mali-G77 MP11, LPDDR5X, dedicated Dual core NPU

But Cortex-A* itself is just a instructions set – that implemented by particular CPU. GPU – is also clear, but what the hell is DSP\NPU\TPU\VPU\?!

There is class of neural networks – convolutional one – that rely on heavy matrix multiplication operations. Often used for tasks related to computer vision. If matrix is not two dimensional but three – is cube or higher it is called tensor.

As many tasks on devices are related to images – manufactures decided to create a specialized compute unit ~ ASIC – to efficiently perform those operations

In regards of difference between GPU & TPU:

TPU is very limited in terms of supported operations (and as we learn later actually to input size) – memory organization is very simple, while GPU have much higher set of supported commands and memory hierarchy with caches. Modern higher end smartphones usually have dedicated NPU units, so it is worth to explicitly specify corresponding capabilities of top tier devices (as of 2020).

Kirin: NPU is split to two compute parts: 1 Lite + 1 Tiny

  • Da Vinci Lite features 3D Cube Tensor Computing Engine (2048 FP16 MACs + 4096 INT8 MACs), Vector unit (1024bit INT8/FP16/FP32)
  • Da Vinci Tiny features 3D Cube Tensor Computing Engine (256 FP16 MACs + 512 INT8 MACs), Vector unit (256bit INT8/FP16/FP32)

Snapdragon 865: 15 TOPS

For comparison (*) – characteristics of accelerators used for AI tasks:

  • Google Coral: 4 TOPS
  • JETSON Xavier NX: 6 TFLOPS (FP16) & 21 TOPS (INT8)
  • JETSON Xavier AGX: 5.5-11 TFLOPS (FP16) 20-32 TOPS (INT8)

(*) Nice numbers! But what they really mean?

In computing type of data (8 bit integer or 32 bits float) on which you operate really matter – as you might have a hardware tuned to perform operations with specific data types very fast. Or not.

Hence a bit terminology is necessary:

  • TOPS – terra operations per second ~ throughput.
  • MAC = Multiply–accumulate operation.
  • FMA (fused multiply–add) – most of modern hardware architectures uses instructions for operations with tensors. FMA computes: a*x+b as one operation. Roughly GMACs = 0.5 * GFLOPs

Bonus point – that you can’t directly compare those numbers between different hardware as a part of synthetic nature of them, there are other factors drastically affecting performance – related to data coping back and forth and capabilities of other hardware, that post or pre-process those data.

As in many such cases – it was a good reason to just try it – so we prepare Raspberi Pi with Google Coral and fun is started.

ML frameworks for mobile – is there really a choice?

First of all, let distinct two use cases:

  • inference – when you have a model and want to run it on device
  • training – when you do not have a model (but want to have one!)

Majority of mobile ML frameworks doesn’t support training on device.

There are wide majority of different devices – read as different hardware – top tier devices have NPU and GPU, and low end sometimes can rely on CPU only. Presence of necessary hardware doesn’t necessary mean that software layer can properly utilize it in place. I.e. that is YOUR duty to make sure that app will run in one way or another on user’s device.

There are number of vendor specific library and SDK that can help you to accelerate specific computation on vendor specific hardware. With one tiny caveat – they all works in the following fashion:

you create a model -> then convert to framework specific format -> then you should deploy at device framework specific runtime, that know how to squeeze extra operations from supported hardware underneath:

Obviously, you usually want to have your app installed on as many devices as possible instead of having list of supported processors.

One of possible workaround here – prepare set of models tailored for different devices – so when app start you can ask that existential question – “Where I am running and what I have access to” and pull from your model’s zoo what can is the best fit and fallback to CPU only in case there are no optimized versions available.

There are open source – TVM – and proprietary solutions that try to do exactly it – take your model and optimize it for different hardware.

Sometimes, it is not necessary even take that thorny route – if what you want to do via ML is semi-standartish – probably you can try your luck with semi-ready solutions from industry: ML Kit, Create ML, MediaPipe. Some of them can serve your models via API and you can also try to train them by submitted labelled data: Firebase ML, Lobe.ai, MakeML.

If you are keen to explore what options, you have for vendor agnostic setup to run it completely offline choice is really not so big:

What is the golden rule of choosing tech stack for important project? Use things that you are familiar with! That’s how we started with tensorflow lite.

Dark art of model conversions

First things first – you can’t just run your keras model in TF lite – you have to convert it to tflite format.

And guess what? – after conversion it will be another model. With another precision and recall.

Because what convertor actually do – it took a computational graph of network and try to throw away redundant operations and replace operations with those that supported by TFLite interpreter.

Term “supported operations” are ambiguous one – do you remember somewhere above I’ve mention that hardware may support operation but software not? Well, opposite is also may happen (but we will get to it in a minute)!

In theory, conversion should be straightforward but in practice, if your model is somewhere advanced you may need to drop first or last layers if they not supported or dive in model refactoring to help convertor doesn’t screw up your model characteristics too much.

When model is finally converted – nothing is finished actually – you now need to make sure that

  • it can run on real device
  • it produces outcome more or less close to what your keras model did

Actually, you have an option to run TF Lite interpreter on your laptop\desktop, however it is non-optimized for such case at all and inference time might contribute to your procrastination heavily, but you will be able to get a first grasp of actual quality of inference.

Another false assumption that we had – that we can test converted model using available single board computers(read Raspberi Pi) and TPU that support Tensorflow Flow Lite (read Google Coral). Initially idea was promising – friendly linux system is up and running, compatible CPU architecture – ARMv8. But characteristics of CPUs much lower than modern phone have: less cores, no mobile GPU and performance of TPU units seems to be overkill in comparison even with best ordinal GPU – i.e. we can’t reasonably assess what we can or can not do on the real clients.

But the main thing – and lesson we learned hard way – when something is running on Raspberry with Coral – it doesn’t necessary mean it will run on smartphones.

Damn, okay, I have a python and cpp code, TF Lite model and two devices with Android and iOS, what’s next?

This bit might be surprisingly easy (*) – you just need to build a corresponding bench tool from Tensorflow source tree:

(*) easy, if you keen to get familiar with bazel as a goto tool for building monorepo, like flexibility of adb and do not mind to tweak XCode project – for example to exclude emulator from target.

Not sure that I find iOS tool very useful as it working as a black box – do not allow you to specify where exactly you want to run your model. But what was useful outcome of that exercise – to learn that at iOS you are not run tflite model directly – after start of app it gets converted to CoreML format and those model will be used for inference.

On the other hand, with Android benchtool you can directly explore what will happen if you try to run your model at NPU\GPU\CPU.

And it will reveal sad truth:

ERROR: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors.

or

ERROR: NN API returned error ANEURALNETWORKS_OP_FAILED at line 3779 while completing NNAPI compilation.
ERROR: Node number 536 (TfLiteNnapiDelegate) failed to prepare.

else you can find something more specific at logcat:

07-17 15:44:42.210 14380 15006 I ExecutionPlan: Device qti-dsp can't do operation MAX_POOL_2D

In some cases – it can be even worse – i.e. it supports some operations at NPU, some at GPU, and remaining at CPU.

INFO: CoreML delegate: 41 nodes delegated out of 315 nodes, with 46 partitions.

From my past experience working with GPU and more recent within Spark and distributed computing – I remember one crucial thing – IO time – to transfer data to accelerator\compute node – can neglect and drastically decrease performance of such computation.

I.e. it might be faster to run everything on CPU than run portion of computations at CPU and another portion at GPU.

Another interesting bit is “delegate”: let’s say that your network has some operation – i.e. ReLu. TF lite can try to use its own code to express it via simple math operations, or if you are lucky (?) – you can delegate those computations to optimized library that know how to use processor’s instructions to increase performance. So you have NNAPI delegate for Android, CoreML delegate for iOS that try to choose best hardware for your operations, or dedicated delegates for GPU or CPUs.

In order to overcome conversion-compatibility hiccups – there are no straight path to success – it is really matter of trial and failure: when you use different format (saved model – or h5 – or pb – just or protocol buffer) resulting model might be different (very!).

I have heard funny stories where people have to run chain of conversions to get compatible operation set: Pytorch -> Onnx -> CoreML and specify particular version of opset or doing amazing things like tweaking operation from a - b into a + (-b) that make model successfully run.

In our cases we have to give up ideas of using NPU for computations – first of all because of operation’s compatibility issue – i.e. for example 2D max pooling in NNAPI works only with input tensor with rank equal to 4 and data layout have to be those that NNAPI implementation expect.

Hence you can’t prepare single model and be sure that it will run on all devices. Efforts required to tune model for particular NPU not (yet?) overweight its possible speed benefit. GPU though show itself quite impressive options – in our benchmarks, on average, we observe ~ x5 time speedup in comparison with CPU. Now majority of our network run successfully at GPU (well, at the moment we do not have bug reports related to it!)

But when model run – it doesn’t necessary mean it going to work!

All those delegates have to support operations in your network and, with all optimization in place, must return similar results – otherwise accumulated deviation of precision can lead to wrong class label at the end – that was exactly case that bite us along the journey.

Which imply that we should not just run model at device with random noise input to test compatibility and performance, but run it on the data from test set to compare results.

And in order to do it some more interactive application can be quite handy.

But how actually run it – i.e. you have business logic in Java\Kotlin or Swift\Objective-C or maybe flutter \react native on the client and there python code with actual pipeline that is a bit more than just run this model?

Our approach was to embed everything to old(?) good(?) C++ library – i.e. single code-base – single entry point for clients to interact with ML functionality.

Single code base – was a good idea in theory, but in reality it leads to few consequences:

  • you have to compile it for target arch – welcome to the world of cross-compilation!
  • in order to use it on device – you also have to link it with all necessary dependencies – i.e. Tensorflow lite and OpenCV and their 3rd party dependencies!

Did I mention that in Tensorflow repo Bazel is a first class citizen and old (?) good (?) CMake not so supported so we have to tweak a bit Tensorflow (and sadly it is not everything that was necessary – didn’t have a chance to prepare remaining bits) itself to make it happen.

After some fun with Docker – we have prepared an image with all what was necessary to use it in our CI for tests.

And then start paying attention to actual benchmark results and model size. Usually they quite related.

For sure you are aware of TensorRT that can be handy for model optimization on Cuda device, for the edge you still can utilize it to prepare model before conversion and apply similar tweaks for model optimizations:

  • Quantization
  • Model Pruning
  • Knowledge Distillation

What was really breakthrough for us – quantitative aware learning, that helps narrow down model size from 150 MB to ~ 1 MB without noticeable accuracy loss.

While we are still arguing with UX designers in regards of user flow through the app –  we didn’t touch topic of model on device protection (encryption, obfuscation) – which definitely something that cloud solutions doesn’t have to deal with. However, given everything what is written above – probably people who can understand from the structure of model and disassemble c++ code from inside package of mobile app – how it should be used – better should work with us?

Practical performance tuning as never ending journey to widen knowledge

Golden age of programmers who were able to fit in tiny RAM of first gaming consoles the whole universes of legendary games had passed few decades ago. Now your favourite browser can easily swallow gigabytes of memory in order to render single web-page with myriads of annoying ads that ad-blockers trying to defeat. Relative abundance of computing power bring to programmer’s community privilege of not knowing what is happening under the hood of their beloved frameworks and focus more on business domain. Convoluted web of multi-hierarchical abstractions, advanced garbage collection, ready to plug libraries polished by industry leaders, infrastructure “à la carte” in clouds, diversity of languages with various flavours of syntactic sugar – everything is tuned towards holy aim of decreasing time to market. Need more power? Vertical scaling in many cases is the cheapest and fastest way to squeeze more operations per second. But what if we still need more?

By no means it is not comprehensive guide or blue prints for performance tuning, just couple of thoughts and ideas to inspire thinking out of the box and broaden your horizons – from my perspective – the most crucial skills that is necessary to tackle performance issues.

Lets talk performance and constraints!

For warm up lets start from simple and artificial task that I recently read from Programming Pearls. Apart  brilliant pragmatic ideas for software development this book contains number of exercises for fun and self-education, one of them can be used as a great illustration of several important aspects that we should take into account when we talking about performance.

Problem statement: Imagine that you have a file where every line contains integer number. Numbers are not sorted. We know for sure that file should contain all numbers in range [INT_MIN, INT_MAX] except exactly one. And our task is to find those missing number.

  • INPUT: file with INT_MAX + INT_MIN numbers
  • OUTPUT: int, missing number

Sounds simple, right? Just sort and play with binary search.

  • Runtime complexity: O(N Log N) sorting + O(Log N) bin search
  • Space: O(N)

Lets say we rely on c++ and use old good(?) x86 architecture where each int have size of 4 bytes and total number of unique numbers is a bit above 4 billions – 4,294,967,295. Supposedly we know in advance all black magic happening behind memory allocation on our system and can do it properly, but without going too crazy. If we want to read all numbers in memory it become costly – just for numbers only, without any overhead it will require over 16 GB of RAM. This looks a bit discouraging.

Orkay, we are very experienced developers and know about out of core algorithms – merge sort, for example, can help if we ready for several iterations of loading chunk of records, sort them and save into temporary files. But what to do next? Hmm, we can merge them later into single file that would contains all sorted numbers. We know exactly the whole range so we can iterate over file with sorted numbers to compare line number with actual number (with necessary adjustments for negative integers in first half of entries). Lets say we can afford 5Gb of RAM, in this case we need 4 passes to sort numbers in chunks, we can merge them in linear time and after that sequentially read the whole file. In theory it is still

O(N Log N) for sorting + O(N) for merging + O(N) for sequential search.

But if we talk about real performance – in this case our big O asymptotic will be heavily smashed with reality of multiple heavy I\O operations. Due to memory constraints we most likely do not have spare RAM disks available. For sure, we know those number that every programmer should know. Also, we aware why OSes not so fast when working with block devices – several obvious parts of this equation: actual filesystem and chosen device type. Lets assume we use modern operating system where buffered I\O available behind fwrite/ fstream interfaces.

Would it be even better to use binary search with fseek or jumping through mmaped file? But it expect offset in bytes and we have line number? Probably we can figure out proper formula to adjust offset value given line number and additionally analyse whether previous symbol is carriage return? Or even better use binary format to save intermediate files with fixed size of every record – equal to 4 bytes? Should we stick with more tailored methods like interpolation search – as our key are numbers?

What if we do not have 5Gb? And our hard limit is around 1mb? Sorting and merging chunks become crazy slow.  But do we actually need to sort full array? What if we partition our data using additional files as buckets – i.e. if entry less than current pivot – add to left file otherwise to right? And at the next iteration work only with smaller file? For pivot we will choose median element of current range and do not add it to any file to deal with odd total number of elements. Still noticeable number of I\O though – huge factor that break all asymptotics with harsh reality of long operations.

Let’s start thinking over again – how we can define those numbers without enumerating them all?

Lets recap:

  • we are dealing with huge number of distinct integers in non-sorted sequence
  • we do not need to preserve original order
  • we do not have any a priory knowledge about expected permutation. If we say that distance is absolute value of difference, between line’s number and value of integer residing in that line, is there any particular distribution of distances or the whole sorted array just shifted a bit?

On the other hand it is only 11 distinct symbols: 10 digits and optional sign symbol, that form our alphabet for representation of words – numbers. It can be defined as some sort of regular grammar. We also have boundaries of possible values, which make definition of corresponding regex a bit less elegant, moreover it doesn’t help us to identify missing entry.

Once again, we can’t fit array in memory, can we? Lets re-phrase – we can’t fit all numbers in memory using built-in language’s datatypes. What about more compact representation? Protobuf use variable length encoding – that can decrease size of small integers – i.e. we do not need the whole 4 bytes for something that can fit in single byte – not too helpful in our case. Should we check algorithms of lossless compression? Naive RLE based approach will be more memory hungry if we use std::string that may have compiler specific memory overhead, it is not as severe as in jvm based languages, especially pre-1.8, but still noticeable. Given our fixed range of numbers – percentage of entries with 3 or more repeated digits are less than 10% – not so high to justify string overhead. What about more advanced methods? Deflate may be good general purpose algorithm, but we can try to use something tailored specifically for integers! This implementation, for example, promise that it can decrease requirements from 4 bytes to up to 5 bits per number (lets forget for a moment that it require most numbers within array to be small). Even if it works for arbitrary integers it is still requires above 2.68 GB + overhead to compress\decompress. Additionally, compression usually are not stream friendly – i.e. we can’t provide complete array as input and have to read data in chunks, feed content of buffer into compression routine in batches which in turn make compression less efficient. At the end it would not be easy to iterate through compressed array as well as random access by index will not be available. Seems not be very practical in our case.

If we recap low level representation of integers – there are 4 bytes per number with special care for sign. What if we use similar notation for depicting which number is present? Imagine long string of bits, where i-th bit is set if number i is present within input array – so we can read number by number from file, set corresponding bits and later check our bit string to find position of unset bit. This can severely relax our memory requirements – we would need (approximately and implementation depended) – 4 * ((string length + 31) / 32) bytes ~ 500+ MB. It is still big, moreover if we try to use std::bitset based on helloworld like examples, we most likely end up with seg fault with this “innocent” line, even if we have abundance of RAM:

std::bitset <100000000> w;

Why? Generally, memory of computer will be split between kernel and user space, and structure of memory of particular process residing in user space depend on type of executable. Within process’s memory, dedicated portion will be allocated during startup for stack to keep return addresses of functions (within call stack, during program execution) and other stuff like local variables, arguments, etc. Size of this area usually restricted by OS – limits of process’s stack size. Okay, we will allocate it on heap. What about asymptotic assessment? In theory it should be O(N) runtime complexity and still O(N) space, but in practice we were able to decrease hidden coefficient to be small enough to significantly reduce actual size.

But even with our runtime complexity it is also not so straightforward as you may think. Let’s forget for a moment about overhead of reading data from file and converting string to integers, suppose we have decent workstation that can fit everything into RAM without hiccups and start from there. What we are doing is actually a bitmap sorting that indeed have linear runtime complexity. But what about actual performance in comparison to comparative algorithms – it seems that memory hierarchy can still hit us in terms of real performance due to patterns of accessing memory that lead to cache misses and not utilising branch predictions, closing gaps between theoretical complexity and actual execution time.

All great, but 500GB is still too much for us. Hm, what if we have all numbers within our boundaries? Then we easily just use two number to reflect those range: [INT_MIN, INT_MAX].  And if exactly one number is missing – we will need just one more variable to reflect it: [range_start, range_end], missing. Now, what about finding that missing number? What if we sum all numbers within range and subtract from it actual sum of numbers from the file? Runtime complexity is linear – one pass over range + summation of all numbers from file and just two auxiliary variables to store result of two sums – i.e. finally O(1) – constant! But, yeah, those variables… Which type that should be? If entries in files happen to be in this order – [INT_MAX, INT_MAX-1, … , ] and we try to sum those two first what we will get? In this particular case relying on long long int with width of 64 bits should be sufficient to avoid overflow. But what we will do if we have to deal with bigger numbers? In this case we either can stick with compiler specific extensions with higher capacity or include into our tiny project libraries that have types which meet our requirements.

Alternatively, what about utilisation of some bit twiddling hacks for the great good? If we XOR number with itself – we will get 0 – i.e. they cancel each other. If we have sequence of number in range  [-10,10] XOR all of them first, and after that try to xor with all numbers in this exact range except chosen one – we will get as a result exactly our missing number! Literally just one pass over range and O(N) for read all numbers from file and O(1) memory – we do not need even two variables, one would be enough! XOR should be even faster than sum operation and no need to care about overflow and related complexity, does it?

Her majesty math suggest that particular kind of sequence may have some handy equations to compute sum: in our case it will be even simpler as we operate from negative to positive extremums – it should be just INT_MIN (-2,147,483,648) – so we can even save N operations for precomputing sum completely!

Happiness, prosperity, work done!

Suddenly very enthusiastic sale manager appear close to our monitor and cheerfully shared great news – he was able to sold our wonderful software, but client asked one tiny change – have at most two numbers missing. His peer mention that he also close to make a deal but it is required to have generalised solution for k-missing numbers. As we already started with some naive equation with sum approach – we can dive in some text books to find out that it is classical case for system equations

In the example above we not just fight with some abstract performance limitation: we try to decrease overall duration of our task – aka wall time by tackling memory constraints that prevent us to use brute force approach. From this warm up you can briefly figure out diversity of related topics (far from complete) and numbers of factors that may affect performance of complex system.

Stories time

Now let me share few war stories. All of them from different domains and tech stacks but common theme that unite all of them – during initial efforts to speed up things people tend to pull wrong levers.

 
int getRandomIntInRange(int MIN, int MAX) { 
         int i;
         do { 
            i = Random.nextInt();
         } while(i < MIN && i > MAX) 
         return i;
} 

Ages ago I was involved in development of system that monitored various metrics of devices in private networks where start point for inventory was address of single router. Initially it support only Sparc Solaris and built around libthread, latter it was re-targeted for a wider audience of posix compatible linux distributions (both x86 and x86_64). There were a lot of technicalities involving maintaining various locking primitives: non-blocking read write locks, spin locks, condition variables and barriers as well as handling management requests through sigaction interface. Part of functionality was to poll various metrics and service information from newly discovered devices using SNMP protocol. This particular component was main target of various concerns in terms of performance. Initial vision was to develop dedicated thread pool that maintain task’s queue with details of devices, their metadata and corresponding metrics to gather. 4 months of development later we end up with complex system that perform around 7% faster than original, much simpler version. So finally full scale profiling was initiated.  Apart from high number of interrupts and context switches, I’ve stumbled across tiny method that assembled SNMP requests.  It was using exclusively plain snmp get, issuing new request for every OID – parameter to retrieve. There are 3 versions of SNMP protocols – each of them bring new features. What was handy for us – 2nd version of snmp introduce bulk get requests allowed to retrieve information about several oids within single inquiry. Module was extended to first check whether device support SNMP v2 and if yes utilise bulk requests otherwise fallback on simple snmp get. It decrease speed of processing up to several times (depending on number of metrics to be retrieved).

    void doSomethingVeryUseful() {
        /*prepared statement binding parameters*/

        ResultSet rs = dbConnection.execute(statement);
        Bla bla = Null;
        for(Row r : rs.all()) { // fetch everything 
            bla = new Bla(r); 
            break; //get only first 
        }

        /*some processing of first entry of bla*/
    }

During yet another fearless journey in startup word I was trying to speedup some computer vision pipelines triggered on every new frame. Not sure that at that time I was familiar with formal concept of Amdahl’s law – but intuitively it was clear – no mater how many gpus you will throw at your task – single consolidation point will result in idle of all your compute resources in case of strong branch divergence. A lot of efforts were put into gpu kernel tuning – review of various novel methods from ICRA and other conferences, digging into gpu threading model internals (warp vs wavefront, memory coalescing, shared memory bank conflicts) and gpu computing patterns. Harsh truth was that most delays were occurred due to multiple coping from host memory to device (GPU) memory and back instead of defining complete pipeline (stacked sequence of kernels to be executed on GPU) and minimise data transfers as well as using pinned memory and DMA transfer to speed up data movement.

List <Something> getTopRows(int limit) {
    /* go into db and run select all */
    List<> list = rs.all();
    while (list.size() > limit) { 
        list.remove(list.size() - 1); 
    }
    
    return list;
}

Over-excited young performance tuner may think about some esoteric algorithms or data structures but in many cases sad truth is just code quality are not so high – redundant abstractions, abundance of dead code hanging in ram, relying on anti-patterns for particular language or technology or just bringing code of prototype into production.

Another cautionary tale was happening with yet another company that tasked me to speed up method that backed one of core REST endpoint preventing smooth loading of main web-page and mobile apps. Initial suspicious was it happening due to lack of proper indexes in postgres, but I was already not so naive so started with profiling. From my observation every requests lead to burst of reads from disk, followed by high cpu usage. Initial idea was to simply offload it into redis using cache aside strategy with forced warm up during startup. Diving into code to find out what was actually happening:

  • we read some text data with attached labels and weights from db
  • we pre-process and build feature vector for every piece of text
  • the whole ML model was re-built based on those intermediate data representation
  • payload of request were pre-processed and forwarded to that model
  • model assign priorities to list of some goods
  • backend return top k entries from those list to the client and result set have to be maintained for server side paging

Further discussion with stakeholders lead to conclusion that model actually can be re-builded once a week i.e. it is fine if we lose some precision because didn’t take into account entries updated in last days. So we end up with additional serving layer for model residing in redis, that were rebuilt once a week in off-time.

Based on analysis of users behavior – majority doesn’t click through too many pages – first 7 pages of result set was cached as well using TLRU as cache eviction policy. For situation when user want to see results at page 8 or further – we will issue a new search request.

yaroslav [12:37 PM]
Dmitry, the situation makes me a little nervous. 
Have any guess about revision in which tests were passed well?
None master, none "some test fixes".
It's just insane.

One of my favorite tasks are related to slow databases. Cross-joins, db or table lock, poor choice of partition keys, building aggregates for ancient immutable data every time we do reports, flirting with ORMs that tend to increase complexity of code, using redundant precisions for cases when only three distinct values are possible – there are rich arsenal of methods to kill performance of any db whether it classical rdbms or non-sql, with commercial supports of open source.

One company complains that during particular report generation there were observed timeouts, it is very slow and it seems that it even affect overall performance of db cluster. There were common beliefs that database have to be tuned for better performance, and, if it required – shiny new servers with corresponding efforts to migrate all data can be considered as option. During profiling network utilisation at client’s host, I’ve witnessed suspiciously high network i\o as well as jumps of heap usage. Browsing code I’ve found out that we try to retrieve data points using seconds as key, but within time dimension in db all data are stored within 30 minutes buckets as keys, every bucket contains around 100k rows. In case of several items fall into the same interval we just issue independent queries, ask single node to return hundreds thousands of entries. When we retrieve rows – we do filtering programmatically throw away majority of data. Schema adjustment (and data migration) bring peace and prosperity to that module.

But another component of system require attention – I was told that during write path it seems we reach peak performance as well. I’ve looked at code – at first glance consumer’s threads read data from kafka, do various data enrichment activities and synchronously write entries one by one (i.e. no bulk\batch write). At some moment no matter how many more consumer threads were added – rate of appended rows was more or less the same. However db doesn’t show any signs of overload. Probably driver includes client side load-balancing in order to back off rate of operations when cluster is struggling? Maybe host where consumers are running can’t withstand bigger load? Nope, it turn out, apart from actual duties, every worker also gathered some statistics, from time to time it was aggregated from all threads for further reporting in “lets-lock-everything” fashion.

[2000-00-00 00:00:00,000] WARN Session 0x15ab9f8e3e00000 for server 1.1.1.1/1.1.1.1:2181,
 unexpected error, closing socket connection and attempting reconnect
(org.apache.zookeeper.ClientCnxn) java.lang.OutOfMemoryError: Metaspace
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred
 dispatching signal SIGINT to handler- the VM may need to be forcibly terminated
[2000-00-00 00:00:00,000] INFO Partition [some_topic_name_20,14] on broker 0: No checkpointed
highwatermark is found for partition [some_topic_name_20,14] (kafka.cluster.Partition)
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f8a2c580000, 
262144, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 262144 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /opt/kafka_2.11-0.9.0.1/hs_err_pid68543.log

And the last one, recently I’ve been playing with ETLs for spark that were under suspect of slowdown due to shuffling, I’ve stumble upon small method for date formatting, later wrapped as user defined function – UDF. ETL and udf was written in python, spark in scala i.e. JVM based. Udf by itself doesn’t considered to be performant beasts, but in this particular case in order to apply it – every row have to be copied from java process to python with all related overhead of deserialisation.

// No chance I would be able to translate it to english in order to reflect pain
// TLDR; one of my peers find out byte-by-byte comparison of images
yaroslav[12:05 AM]
Маму! Карму! Бога! Душу!
yaroslav[12:05 AM]
Ты знаешь как твой талант сравнивает пнгшки?
yaroslav[12:06 AM]
Он сравнивает размер. Не на равно, а на отношени 1 +- 0.025. А затем сравнивает ПОБАЙТОВУЮ разницу контента к длинне с 1+- 0.25.
yaroslav[12:07 AM]
Встретишь пожми ему за меня горло.

Common observations – people usually didn’t invest too much time into learning and understanding technologies that they try to use and during testing it usually went through happy path. There are various examples of choosing right tool to do the job or being influenced by hype without thorough stress testing (not to mention good old motto of not updating anything at production without absolute need). Or it is just matter of using best language?

My philosophy is to absorb more details how things actually work throughout the whole stack and in order to deal with dead ends – read even more 🙂

There are several absolutely brilliant publications related to various aspects of performance tuning:

 

Few more war stories for inspiration towards evolution of idea how to speed up things:

Distributed systems – key concepts

Databases – how they work under the hood? Key takeaways from brilliant book “Designing data intensive application” – to quickly recap core concepts.

DB engines classifications

  • Type of load: OLTP (transaction processing) vs OLAP (data warehousing and analytics)
  • Relational vs NoSQL, document vs columnar, graph vs triple-store (semantic facts storage)
  • Even within NoSQL camp  you can find wide column based db – descendants of Big Table (DynamoDBCassandra, HBase) vs purely key-value store like Riak, Redis.
  • Classical RDBMS vs MPP (massive parallel processing engines) – usually relational as well but will benefit from parallel processing – Impala, Clickhouse, Presto
  • Data access patterns: database\api call – rest or RPC\asynchronous message passing using message queue or actor framework
  • Schema on read vs schema on write, which usually involve topics related to forward and backward schema compatibility – i.e. they not really need to be matched completely.

How data are stored

There are two key approaches for data storage organisation:

  • Based on Log-structured Merge Tree (LSM) engine. Generally, beneficial for range query and friendly for sequential writes that may offer high throughput. Its foundation is SStable (sorted string table) – log-structured storage segment of key-value pairs, where entries are sorted by keys, stored in memory (hence named memtable). Data is organised in some kind of balanced search tree (usually red-black or AVL) and from time to time it flushed to disk to actual SStable files, that later will be merged during compaction. Compaction – dedicated process that merge SStable by combining new values of similar keys (using last update win technic) and marking removed entries with tombstones. Various strategies available: time based vs size tiered vs levelled compaction (less disk space but big write amplification). Detailed overview of implementation LSM-based storage engine in Tarantool DB – here (in Russian language).
  • In B-tree based data engines, segments have fixed size – usually page size – disk read and (over)write one page a time (even if just one field was changed) vs append only approach for LSM-tree. WAL – write ahead log (redo-log), append only, contains every operation on B-tree to protect from any failure during write operation. Using latches – lightweight locks – to deal with concurrent operations.

Generally B-tree fast for read, LSM-tree – fast for write, but, as usual, it depends. 

Write amplification – on single write – multiple actual write operations under the hood in db engine.

Index can be either in a way of key->reference, where reference will be pointing out to the row stored in heap file (handy in presence of multiple secondary indexes – all of them refer to the same reference) or clustered index – where key directly pointed to row without additional level of lookups. Compromise between them – covered indexes – that contains not complete row but some columns. There are special kind of indexes for dedicated data i.e. for multidimensional indexes – R-tree.

Data modelling

De-facto classical approach for data warehouses (centralised data store from multiple sources):

  • Star schema (dimensional modelling): fact table (event) vs dimension (SCD – slowly changed dimensions – who, where, when, etc).
  • Snowflake ~ star but dimensions are sub-divided on sub-dimensions.

Another interesting point of view is system of records (fact, ground truth) usually normalised and derived data (transformed and enriched, non-normalised)

Normal forms – as corner stones of data organisation in RDBMS.

Fact tables are wide (100+ columns) but query usually require just a small subset of all columns. In case of document database the whole document will be read from disk to get just few attributes, OLTP is row oriented – i.e. read the whole row – 100+ columns – parse them, hence, columnar data storage beneficial for analytical kind of usage. Additional advantages are compression (the simplest approach would be to repeat just ids of unique values within column) and vectorised processing (applying predicates to several entries using single cpu instruction). If data within column – entries – would be sorted – retrieval also would be faster, but for second\third order will not help for cases when we need to return complete row. NOTE: Column family based db engine is not the same as column oriented (Cassandra\HBase)!

How data are spread across the nodes:

Replica is a node storing full copy of database, may be leader (master or primary) – accept writes and may accept reads, but usually offload read load to follower (slave), that accept change stream (replication log) from master and reads form clients. Failover – process of handling failure of leader. There are several common approach for organising db:

  • leaderless
  • single leader
  • multi leader (examples are multiple datacenter\client with offline operations\collaborative editing). Common issue is how to resolve write conflicts.

Replication of changes between nodes are done through replication log, which can be either:

  • statement based replication (caveats are operations with side effects – now(), rand(), sequence numbers, triggers)
  • physical replication – coupled with stored engine internals ~ analysis of changes in write ahead log (WAL)
  • logical (row-based) replication

Replication may be synchronous\asynchronous\semi-synchronous. Async operations weakening durability (data persistence guarantee) of writes but increase speed. Different topology – circular, all to all, star. Alternative is to rely on some erasure coding scheme allows lost data to be recovered with lower storage overhead than full replication.

For cases when data is too big for single node partition based data distribution is used – some nodes responsible for portion of data and dataset spread across nodes, no single node that keep everything. Each node may be leader for some partition and follower for others. Db engines quite enthusiastic about how to properly name partition:

  • shard – elastic\mongo
  • tablet – big table\snowflake
  • vnode – cassandra
  • region – HBase
  • vbucket – couchbase

 

The most naive forms of partitioning are:

  • key-range based when entries are routed to particular node based on routing settings defined by ranges of key (0-1000 1st node, 1001-2000 – another one) – key distribution most likely not be uniform
  • hash based (DHT – Distributed Hash Table) where target node is determined by formula  (hash_function(key) mod number_of_node = node_number) – rebalancing become nightmare.

If load is skewed – partitions that contains hot keys (responsible for disproportionately huge volume of records or requests) will experience high load and become hot spot during processing. Common example is time-based partitions.

Consistent hashing extends idea of using hash for routing requests by utilising number of token ranges (hash values) per node. Number of such ranges can be fixed or dynamic – if size exceed some limits, data will be split further and transfer to corresponding nodes (HBase, Mongo).

Secondary indexes in such settings become more complex, it can be either:

  • by document – every node will contain indexes for data with local partitions only, need to do scatter\gather\rank during query 
  • by key (term) – dedicated index contains references to all partitions containing term, downside is that it have to be sharded as well

 

How requests are routed from client to partition that contains data:

  • routing tier – often rely on Zookeeper (Kafka\HBase)
  • any node may be contacted – in this case it become node coordinator that will forward request to node with data, knowledge about partitions are acquired through gossip protocol.

Nightmare of data consistency

In case two or more nodes that accept writes it is open question how to handle writes for the same key:

  • Conflict-free replicated datatypes (two way merge functions)
  • Mergeable persistent data structure (three way merge functions)
  • Operational transformation 

One of key approach for conflict resolution in multi-replica setup without leader – version vector. It contains version number per key, per replica – so it can distinguish whether two operations are concurrent or one operation causally depend on other. Lamport timestamps – used for total ordering – but you can’t say whether they really concurrent or causally dependent – advantage – more compact. 

There are several tricks to improve data consistency:

  • Read-repair – during regular read operation, do read from multiple nodes and update stale value
  • Anti-entropy – background process that check for inconsistency between nodes
  • Sloppy quorums. Quorum when for operations acknowledgement required number of votes from several nodes – usually absolute majority, sloppy quorum when not designated nodes accepting requests.
  • Hinted handoff – loading writes into coordinator node in case target node is not responding, for increasing write availability

But even if we talk about old good single node database various issues may happen:

  • dirty read – one transaction read uncommitted writes of another transactions
  • dirty write – one transaction overwrite (uncommitted) data from another transaction
  • non-repeatable reads – values for the same rows are different during the same transaction, some time called read skew. Reading old-data before commit may be harmful for backups and analytics.
  • phantom reads – during single transaction new rows are added or removed for the same query put it another way – write in one transaction change result of read in another

Transaction – set of reads and\or writes – executed as single operation, if it succeeds – commit all changes, if it fails – abort\rollback. It is the key component for OLTP engines that offer ACID guarantee: 

  • Atomicity – is about handling halfway changes – transactional commits or rollback – all or nothing.
  • Consistency – data invariant always true i.e. db will be in a good state before and after any transactions, depending on application may rely on db’s isolation and atomicity
  • Isolation – handling simultaneous access to the same record – pretending that no one will concurrently accessing database (various forms are present). Note that it is different in comparison to single object write operations aka compare and set – so called “light weight” transactions in distributed systems. 
  • Durability – persisting state of committed transactions in case system failure 

 

Dbs provide concurrency control – different transaction isolation levels: read non-committed, read committed, repeatable read, serialised to satisfy following conditions:

  • Reading your own writes – read after write consistency (for example to cross device access)
  • Monotonic reads – read from different replica (possible solutions is to read from the same)
  • Consistent prefix reads – if a sequence of writes happens in certain order anyone reading those writes will see them in same order

Read committed – read operation will see only committed values. Default transaction isolation level for many db – postgres, oracle. Implemented by row level locks for writes + memorised values before and after transaction, return old values for any read ops before new value actually committed.

Snapshot isolation – each transactions read from consistent snapshot of the db – at the start of transaction, some times called repeatable read.

MVCCmulti-version concurrency control maintaining state of db in different point of time. Some times can be used as alternative to locking. Key principles – readers never block writers and wise versa,  maintaining set of old committed versions for every transaction in progress. Updates usually implemented as delete + insert with transaction id for reference. Db index in this setup may refer to all versions of values or copy-on-write (B-tree may be used). 

One of anomalies related to simultaneous writes – read-modify-write cycles – lost updates (incrementing counter or modifying a list). Possible workaround – atomic write but db usually doesn’t support it for all data types. Alternative is to rely on explicit lock by app code. Many db engine provide lost update detection that require transaction to be re-tried. For non transactional db atomic compare and set operation can be used to prevent lost updates. For replicated data – need conflict resolution technics, common is last write win – error prone, commutative atomic operations – good.

For write skew – when the same objects are red by multiple transactions and some of them are updated by different transactions – best option is to rely on serialisable isolation level – may be actually executed in parallel but result are the same as serial execution. Examples where it actually important: games – move different figure to the same cell or booking system, that prevent double booking – or double spending from account. Common pattern for write skew is: select + logic + write.

Options are:

  • rely on serial order (as in Redis), sometimes through procedures.
  • partition per cpu core => cross partition coordination very slow.
  • data fit in memory + anti-caching (abort transaction load in memory retry)
  • Two phase locking (2PL) – readers and writers block each other (opposite to snapshot isolation), shared and exclusive locks. Predicate locks for the whole object doesn’t present within modern db. Index-range locking – approximation of predicate locks.
  • optimistic concurrency control techniques – serialisable snapshot isolation. try to detect conflict and only if they present abort transactions. One of approach is to materialise conflict – create rows before hand and lock them. 

For distributed databases in order to tackle atomic commits and offer consistency guarantee for situations like consistent snapshot or foreign key reference across partitions (when transaction fails on some nodes but succeeded on other) commonly used two-phase commit protocol (2PC) :

  • save data at nodes
  • send prepare requests to participating nodes
  • send commit if all participant on1st stage responded with yes

It involve coordinator or transaction manager – if it fails before making prepare\commit(?) requests not really clear what involved nodes have to do. XA – extended architecture – standard for implementing 2PC across heterogeneous technologies ~ C api for coordinating with transaction coordinator 

Total order broadcast (atomic broadcast) – reliable delivery of messages (if msg reach one node => it will be delivered to all) + totally ordered delivery (msgs delivered to nodes in the same order) – even if a node(s) or network are faulty.  ~ asynchronous append log

Transaction isolation levels is a safety guarantees, and are not the same as distributed consistency (coordinating state of replicas in the face of delays and faults) – it is not set of operations within transaction, but status of individual object replicated among several nodes. Linearizability – recency guarantee – so called atomic|strong|immediate consistency it can be treated as impression(!) of single copy of data with atomic operations. Even strict quorum can not guarantee it. Read repair can help for read and write, but not for compare-and-set\increment-and-get as it requires consensus among the data nodes. 

Fundamental characteristics of consensus:

  • epoch sequence number with every request to filter stale data
  • choosing leader
  • vote on leader’s proposals.

Key difference in comparison with 2PC – leader are not elected and we do not need all nodes to agree –  majority of votes will be sufficient. Another feature of good consensus algo – ability to add new nodes  – dynamic membership extension. Consensus services for writes – Zookeaper, etcd. Consensus algorithms: paxos, raft, zab.

Alternatives for ACID

BASE (Basically Available, Soft State, Eventual Consistency) – favour availability over consistency, weakening of acid

Another common concept is a CAP theorem – system is either consistent or available when (network) partitioned (network failure between nodes). Consistency – in reality mean linearizability – update object by key in distributed system would be atomic – i.e. all reads after writes will return new value, partition tolerance – actually mean network partition i.e. network instability between nodes\RACs\Datacenters.

Hadoop 

Map-reduce cluster can be thought through analogy of multiple unix nodes with simple command line tools: awk, sed, grep, sort, uniq, xargs. They usually do not modify input – i.e. doesn’t have any side effects, except producing output to distributed filesystem. Mapper – some function that would be called once for every record – can generate many output which will be sorted by engine. Reducer take all records for the same key and iterate over collections of values for each key. File at hdfs considered as separate partition that can be processed by independent mapper task.

Idea is to run compute near data: mapper code (usually just jar files) copied to nodes with data. Partitioning by key we guarantee that all values for key end up with the same reducer. Reducer connect to mapper – download result of map stage – sorted key-values for partition – this is called shuffle. Success is when all mappers and reducers finish, i.e. time of execution is depended on slowest reducers. Provide all-or-nothing guarantee – output from failed task discarded. It is not as MPP where focus is on parallel execution of analytical query on cluster of machines. Map-reduce and distributed filesystems can be thought as some generic Os that can run arbitrary programs, for example: query engine – Hive, oltp – HBase, mpp – Impala.

Dataflow engine (Spark, Tez, Flink) can be use to run workflow from Hive or Pig. DAG – directed acyclic graph –  represent flow of data between operators. For graph processing – Pregel model – (Bulk synchronous parallel model of computation) – vertex have state and at every map call get message(s) (fault tolerant and durable) from other vertexes (for example along the edge) .

Joins:

  • Map-side joins. Hive require specifying hot keys explicitly and use map-side join
    • partitioned hash join (bucketed map joins) – both join input have the same number of partitions – hash join can be applied to each partition
    • map-side merge join – when both dataframe not only partitioned but also sorted based on the same key
  • Reducer side joins:
    • skewed join or sharded join – hot keys to multiple reducer (replicate other join input to multiple reducers).
    • broadcast hash join – copy small tables to all reducers

Hive metastore contains info about encoding format, name of directories where data stored, number of partitions and the keys by which it is partitioned and sorted. Sort merge join – may use secondary sort (user-birth->[activity list])

Data streaming

Event stream: producer (publisher\sender) and consumer (subscribers\recipient). Events are grouped in topics or streams. Publish\subscribe model – when messaging system is used to notify consumers about new events. If producers faster than consumers: dropping\backpressure\buffering. Direct messaging vs message brokers(queues). Multiple consumers can be used for load balancing – i.e. one msg for some consumer vs fan out when one msg delivered to all. It may be important to wait for acknowledgements from consumer before msg deletion from the queue. Msg reordering may happening due to network retries.

Jms\AMQP based (RabbitMQ) vs log-based (Kafka) brokers. Topic – group of partitions ~ files (may be at different machines). Offset – sequence number within partition – msgs within partition are totally ordered. No ordering guarantee across different partitions. Consumer group have offsets per partitions to mark what they read so far. Scalability is achieved by assigning the whole partition for consumer node i.e. at most we will have number of consumers equal to number of partitions. Slow msg processing = hold all msg in partitions. 

Changelog or change data capture (CDC) – observing all changes to db and extracting them in form they can be replicated to other systems. Event sourcing – similar concept as cdc but not at the level of db operations – instead it will be app specific events. Difference is history of modification vs current state: cdc – complete row for key, event sourcing – later event not override prior, doesn’t contains mechanism of state update. Command -> validation(?) -> event~fact.

Command query responsibility segregation (CQRS) – separating form in which data is written from how it will be read ~ different read view.

Complex event processing (CEP) – query and data roles reversal – store query to find patterns but data are transient. Example is percolator feature of elasticsearch. Stream analytics use probabilistic algos – bloom filters for set membership, hyperloglog – for cardinality estimation.

Within stream processing topic of estimation time of event become not so straightforward: time of action + time of send to server + time of process by server.

Various time-based windows are possible:

  • tumbling – for example fall in wall clock 1-min interval
  • hopping – aggregated overlapping tumbling
  • sliding
  • session

Joins at stream: require some state. If ordering for events is not determined, join become nondeterministic – slowly changing dimensions (SCD) – addressing it by adding ids for particular version of join – downside breaking log compaction. 

  • stream-stream
  • stream-table – cdc to sync point-in-time cache
  • table-table ~ mat view maintenance

Microbatch and checkpointing for fulfil exactly once semantic IF no side effect: idempotence (run many times have the same effect as once, i.e. set value vs incrementing) or distributed transactions.

Control of load:

  • client side backpressure or flow control
  • exponential backoff to fight with overload
  • phi accrual failure detection (akka, cassandra), tcp retransmission – measure response time and their variability(jitter) adjust timeouts according to observed response time distribution
  • quality of service (QoS – prioritisation and scheduling packets) + admission control (rate limiting senders)

Byzantine fault tolerance

When node try to trick system – byzantine fault, reaching consensus in this environment – byzantine general problem – node malfunction and not obey the protocol. Partially synchronous model with crash recovery faults – common model for distributed systems. Safety (nothing bad) and liveness (eventually good happens). Limping but not dead nodes. Fencing token – auto-incrementing sequence during lock acquisition – preventing inadvertently acting nodes – work in environment where nodes are honest. 

Chronicles of one crypto-arbitrage bot

TLDR;

Chronicles of one crypto-arbitrage bot

Some time ago friend of mine, innocent in terms of exposure to bloody details of executing mid-term IT projects, sold me yet another brilliant idea of how to effectively decrease amount of leisure time – by diving (again) in horror-land of startups.

It was excited journey with up, downs and sort of Challenger Deeps that empowered some of my beliefs in regards of general principles of software development, made me treat another “rules” in a more complimentary way, and definitely allow to grasp over few interesting tricks.

With this article I want to highlight curious constraints that we have to overcome; real-life uncertainties, shaping my architectural approach and illustrate, how project evolved during growth of code base and changing requirements.

As years went by I tend to think that there is two main approaches for starting your own startup:

  • you try to foresee any possible outcomes, address all edge cases, horrified by volume of work and end up with “Oh, hell, fuck it, no!”
  • you start with “Fuck it!” jump in with some awkward prototype written during the lunch to see whether this shit work at all.

 

Bright Idea

It was early 2017, term ‘crypto’ starts appears in mass media, video cards were out of stocks almost everywhere. ICOs, promising various kind of digital revolutions, wide spreading as bubonic plague during ancient times, raising crazy money in exchange for vaguely compiled pdfs. Gold rush of modern century as it is.

There was bunch of crypto-exchanges offering to trade alt-coins: different subset of trading pairs,
complex rules of fees, volumes of digital “commodities”. It was distributed and (almost) non-regulated market where price at the same coin may be different among exchanges.

Idea was pretty simple:


Deposit money at supported exchange, monitor price difference for some coin, as soon as it exceeds profitability threshold – sell at one exchange, buy at another exchange – just to maximise absolute volume of all coins in all our account’s wallets. Approximated opportunities window (based on manual trade’s execution) sometimes reach up to 15 minutes – i.e. it was possible to send missing coins from another wallet to exchange and trigger necessary order.

All of this still sounded pretty easy – so we agreed on investigation stage: collect fluctuation of prices along the day and analyse them to better understand prospects of project.

At that time I can’t distinct ask from bid, how order different from trade, what is the candle or ticker and how price regulated at all.

Very brief intro to terminology:

Order it is when you register within exchange your desire to buy or sell particular crypto currency. If you want to buy – it called bid, sell – ask.
When someone else decided to put order for the same coin and price be matched – trade(s) will be executed – i.e. coins travel from wallet to wallet, exchange charge their fee, order’s volume will be updated to reflect executed trade, when it become zero it mean that order fully filled.
Volume (amount) it is exactly what you want to trade, but price – depending on exchange, can encapsulate several strategies, most common of them is to define exact price.
It has one issue though – price is changed based on current state of order book – i.e. if no one want to sell something for price that you set – you can’t buy anything.
On practice it mean that if you set price according to current top order, then during period since you click submit – till the moment exchange noticed it – someone else may purchase everything and the next lot in order book would have another price.
That means no matching order – your order may hang, and, depending on exchange rules, may be expired in couple of weeks, depending on price movement.
To overcome this inconvenience another common option is to use ‘market’ price – when you definitely want to participate in trade on best possible real cost (if order matching engine implemented properly).
Ticker – summary, that highlight changes of trades for fixed time period: highest ask, lowest bid for coin pair, and other details that varies from exchange to exchange.
Candle – usually have more verbose information – open-close-high-low prices for time period and volume related info.

First blood

So, yeah, returning to the topic of getting the data from exchange.
I have started looking at public api for chosen exchanges. Can be a great illustration for Noah’s Ark – a lot of very different beasts. In the world of classical exchanges Fix protocol is a common way to get updates from exchanges but even now it is almost not supported in crypto world. Web sockets were complex to use and not available at majority of exchanges. Orkay – it meant we are working through REST. Get data from tickers and save it to look later. A piece of cake!

Initial tech stack and reasoning behind it:

  • I do not expect to have some crazy data not in terms of volume not in terms of operations.
  • Database? But maybe this thing will not fly? Csv files are still very convenient to deal with! (yeah, right)
  • Language – speed of implementation that was what matter the most – so no exotic I-want-to-try something new, no complex power of c++ or scala, no verbosity of java – I just stick with python. 2.7. Because, umm, well, it was installed by default and it has everything as 3? (yeah, right)

Those initial code several python modules and single ugly ipython notebooks probably not even committed, but looking at tickers I can’t believe my eyes – this is too good to be true. Either there were some very core errors in our reasoning (or in the code), or we should abandon all other life activities and dive in implementation of remaining functionality asap.

We will be rich soon!

After discussion decision had been made to start iteratively.

During the first stage I will create prototype application that

  • collect data for further analysis – Arima combined with the most advanced tricks from technical analysis should give us ideal prediction (yeah, right!)
  • trigger notification if observed difference noticeable enough to execute manual trades.

Technicalities:

  • Notifications. It is strange but not all people find amusing digging through log files or read emails. On the other hand telegram has free bot api, mobile and desktop clients with nice UI to simplify analysis.
  • Every exchanges choose to have their own unique naming convention for coins. All right we will create intermediate naming layer and exchange specific conversion routines.
  • Database. Yeah. Probably it is time to have it as collected csv starts exceeds Gb in size after couple of days of running. What about constraints?
    • Hundred writes per second seems to be very safe upper bound (yeah, right).
    • Reads? Only when analysing data (yeah, right).
    • I do know about normal forms, so with proper approach un-voidable schema changes should be cheep.
    • I do not yet know which data dimensions will be the most important for us – so I want flexibility to tweak data tuples based on every dimensions (columns).
    • Implementation language may be changed in future (yeah, right) – so wide choice of _mature_ client’s libraries is definitely an advantage.
    • I have worked with postgres – which seems to satisfy all the above and have a lot of room to go above – thats another reason to use known technology.
    • Just to keep in mind that things still may be changed so some simple abstraction layer that allow to switch database are always good.

When plan is ready it is very easy to type code – no matter that it is after intense working or family hours – pick issue and execute. But… yeah it would be to easy without them:

First obstacles

There is famous quote «Data is modern oil» but exchanges not really keen to share their data – and in case you need granular timed snapshot of market state the single way to acquire it is to collect itself (or buy). Order book may have dozen thousands of bids and asks. If you request it every second – majority of data will be duplicative and its retrieval will be just annoying overhead for exchanges. If every second you want to get info about hundreds of coins – it can be considered as some kind of ddosing. There is a lot of policies how to control end-point abusing: weighted cost for every API invocation and maintaining client’s balance, renewable every quant of time; share quotes of request – per IP or api key; specify global hard limits and as soon as client exceed allowed number of connections\requests throw 429, 503 or even disrupt all communications with offender. It was very interesting to learn such diversity of load balancing approaches by practice – troubleshooting data retrieval processes by playing with timeouts between invocations and spreading applications among several VPS.

All logging related activities are under constant threat of becoming too noisy and be ignored. And it is always very tricky to choose balanced strategy to prevent recipient from creation of special rule for incoming mails to automatically move everything to the trash. As I mention earlier time gap for opportunities window sometimes used to reach 15 minutes, we analyse price every second and arbitrage event may occurs for multiple currencies. As result our telegram channel were bombarded with identical messages making it un-usable (and telegram doesn’t hesitate to ban us for cool down period as well so we have to add some back-off mechanism as well). In order to make navigation through alerts history in telegram channel more simple we introduce naming convention to have pair of exchange and name of currencies as tags. Additionally we have to went through a lot of iterations in regards of alert attributes: severity, depending on size of possible profit; time – our local? (but we were in different timezones), exchange time (but due to their geographic distribution it also be different), UTC from server where we run process. And, yeah, this channel at telegram were muted. For the same reason of huge amount of notification we disregarded idea of adding simple controls to be able to manually confirm order placements.

We decided to store details about observed arbitrage events as well, history retrieval were ready with some precautions for timeout handling and restarting. Now question where actually store data. If you are lean startup and your free EC2 micro instance were expired, but you do not wish to invest time in investigating of alternative options from competitors your option is to have «in-house» infrastructure. I.e. just run `docker pull postgres`, map volume for persistency and to speed up things a bit (just in case for so cold premature optimisation(C)) disable network virtualisation. Now how to access it? Classical setup is following – you have router from internet provider with dynamically allocated public ip address. Depending on your luck and location (country of residence) ip address will be periodically changed by ISP provider. Behind router – in your home private network ip addresses usually assigned by dhcp protocol – i.e. you restart your host where database server is running and it may get another private ip address. In addition to this it is your home network – i.e. NAS with private photos or laptop with work’s projects. There is a lot of various services offering dynamic dns – you have to run special daemon at your server that will update bonded ip address for chosen domain name – I have chosen https://www.dynu.com. Second simple step is to configure your workstation to have statically assigned ip.  Many young hacker’s bundles offer convenient interface over nmap to scan for open ports and check whether by any chance you allow ssh access using login and password.  In order to have at least some protection layer port forwarding were setup using not standard port, ssh was configured to provide access by key file only and old good ip tables allow you to configure banning ip address for failed authorisation attempts. With all this pre-cautions – you can initiate a tunnel with port-forwarding in such way that remote postgres process will be bind to your localhost:1234.

Real programmers test in production

Reactive asynchronous program dealing with real world not so easy to test. And the main reason for it – infinite number of possible combinations of weird situations to happen. Blackout in your home office, un-noticed maintenance of exchange due to recent hacking, issues of internet provider – you name it I saw them all!

First launch with real money went in following way:

  1. we run multiple processes
  2. telegram channel were bursted with endless messages with details of order placed
  3. I saw that telegram started exponentially increase time outs to forward messages from the queue to the channel – i.e. my code do something and it is not visible for us (Oh, shiiiit)
  4. my partner was trying to keep up with message flow – but it was impossible as for every message he have to check two different exchanges filtering by currency.
  5. We need to stop it! Now! Side note: Let’s add to backlog new feature request – we need some mechanism to stop all of them once. At that time I just run pkill to everything with python.

After 6 months of development first 20 minutes of running reveals various issues ranging from annoying to critical.  And in order to investigate what was happening and find out root causes we have to dive in logs and history of orders and trades – situation from bots perspective were quite different in comparison to what exchanges thoughts.

Exchange performance

Kraken – it just doesn’t work. I am still have old script to issue placement of thousand orders and compute ratio of success. At that time it was like 30 something from the whole thousand. At least it was easy to verify and reproduce.

Maintenance of up to date balance state

You have to know how many money you have – as it define scale to what you can trade. In crypto-exchange world there are primary coins – BTC, USDT, ETH – they used for trading as base currency; and secondary – myriads of alt-coins that traded for base. If there are many processes using same wallet it mean that any of them may change remaining volume of base currencies. So issue related to maintaining up to date state of balance was much more trickier. Initially, every arbitrage process sync it independently with every exchange before computing details of order to place. In case of any network issues during balance retrieval – rely on previously retrieved state.

Some of exchanges distinct total and available balance by introducing freezed balance i.e. portion of coins were allocated to registered orders, that does not yet executed. When you deal with response you have to pick proper field to be sure to what extend you can go crazy with trades. Sad thing that not all exchanges actually have it. (Kraken) But anyway I completely missed this bit during implementation – so we have several cases of «Not enough balance» errors during order placement but paired order at another exchange successfully placed and executed – i.e. direct loss. 

Another issue related to balance – burst of requests from the same ip. Balance API is specific for user, and considered as private API with authorisation i.e. more sensitive to load. When price on some coins fluctuating we experience situation that many processes requested it within same time window and response may return with delay or even timeouted. Exchanges start completely ban ip address which was actually fine as no trade were possible in this case because even order book can’t be retrieved. Throwing timeouts, on another hand, was catastrophic as bot have to wait for timeout period (i.e. some one else can place this order) or rely on outdated balance state to place order on one of exchanges, failing to do it at first because of not enough balance.

As we already started experimenting with multi-node deployment – solution to issues above were inject balance retrieval to dedicated process that every second update balance state in redis cache. Every arbitrage bot may access it and check, if last date of update became too old – immediately shutdown itself. After order placement – process forcefully update balance at cache. This approach was optimistic as well – even after execution – it take some (tiny but still may be crucial) time to reflect changes – just humble attempt to minimise time window of uncertainty.

Order book analysis

Initial approach for analysis of current state of market was take the top bid and ask, compute price difference, compare with issue orders, take a look at second pair of bid and ask. But at the same time other players may affect order book – so whatever we are dealing with is already out of sync, and we saw situation when we try to bet on already missing bids, So we decided to do processing of order book only once. Second issue – sometimes price were different, but volume wasn’t enough to get profit. It varies from coin to coin, so we have to pre-compute and maintain volume cap, dependent on most recent coin price and exchange fee, to prevent placement of order leading to loss and use for analysis not only first bid\ask but median of topN of them to approximate possible delay of processing and activity of other players.

Reverse coins movement

Initial idea was just to maximise total balance across all exchanges, but in this case it will be possible (and it happens to us as well) to use all coins from one exchange so just can’t trade it anymore. So we decided to add another mechanism to re-distribute balance evenly across all exchanges by using the same logic as for arbitrage but with lower threshold – i.e. even if it will be zero profit at the end – we still meet our goal. Those logic were triggered for particular currency only if observed dis-balance on pair of exchanges exceed configured threshold.

Busy days

On the rise we have 5 EC2 instances, each running up to 80 processes, history data were stored at Postgres at RDS (bills up to 600$ for this one? Hell no, lets migrate back on self-hosted solutions!). Deals channel at telegram beeped every day. It doesn’t mean we don’t have issues but situation were stable enough to check it every evening for confirmation that everything were fine or re-prioritise backlog.

However first days of live trading brings few obstacles:

  • Logs tend to consume all disk space (yeah, it was very inefficient text file logging) so processes tend to dies on file write (simple logrotate configuration with cron solve it)
  • Deployment also become an issue – manually starting hundreds of processes is a bit overwhelming. Another thing that processes are not daemonized (yeah, I know) in order to have some kind of heartbeat within console as opposite to search in logs by PID and time, so to avoid any issue with SIGHUP they have to be deployed under screen sessions. First we will create new screen, named by pair of exchanges, inside of it we will have dedicated console named by trading pairs to simplify troubleshooting. Same things for other supporting services: history retrieval, balance updating, telegram notifiers, etc. So I drafted few scripts and middleware to automate it within single server. And bash alias to have command to stop processes without screen shutdown.
  • Almost every exchange required to have nonce as part of payload for trading related requests – incrementing sequence of integer number. I have experimented with various approaches but we end up with relying on redis as well to share it across many processes relying on the same api key.
  • Re-balancing initially were run as independent processes and sometimes they compete with direct arbitrage processes, so decision have been made to run them sequentially within the same process

Conquer new heights

  • When we start adding more coins and new exchanges – errors handling become more and more important. On every request we may get in response timeouts or errors (and it actually may mean that order placement were failed, but also it may succeeded!). Errors may also be returned as part of payload – i.e. http response was 200, but payload said error something. Additionally what bot was doing so far was just order placement – i.e. no any guarantee that trades were executed – i.e. manual work required to review open orders and process them. Our solution to both these issue were introducing two priority queue backed up by redis to keep track all placed orders sorted by time. Every queue may be read by one or multiple consumer process that check status of order with the exchange and properly process it.
  • From time to time speed become an issues – we noticed that not always can issue order in time python 2.7 doesn’t have support for async so I have added gevent and pool request, not so convenient to deal with and it doesn’t completely eliminate issue – as we are still operate on top of snapshot of order book
  • Architecture of module, probably, partly a bit verbose, but proof itself to be flexible enough to fuse new exchanges with their quirks in terms of new exchanges’s API integration and addition of new currencies.

Problem of floating point arithmetic in python.

Rounding is a known pain in the ass in many programming language (https://en.wikipedia.org/wiki/Round-off_error) and python not an exception here. Apart of situations when we sell more or less than expected, there were other cases when it affected us. Some exchanges are very strict in cases when you submit volume value with more than expected precision – when API expect up to 6 numbers but receive number with 7 digits after decimal point – the whole order will be rejected (and paired order at another exchange may be executed as they may have more relaxed rules). Those rules are specific to exchange, vary from currency to currency – i.e. you can have 0.00001 of BTC but not for some obscure coins. And, guess what, those rules may be changed – so you have to monitor for any ‘precision’ errors on order placement. Another annoying thing – implicit fallback for scientific notation for string representation for small numbers. I do understand meaning of 1e-6 but usual api is not so smart. In order to overcome all those issues it wasn’t enough just to switch on Decimal class but the whole new layer of conversions have to be implemented to cut redundant part of string representation based on exchange specific rules.

Brightest prospects

Many development activities happens outside core module:

As size of database grows we start thinking about encapsulating news feed and various forum’s rumours to checks whether it correlated with fluctuation of prices. Inspired by this aim independent project were started to create web-crawlers for twitter accounts, telegram groups, medium posts, reddit threads and few more specific resources to get all information within some date-time range. Every new source of information brings new surprise – for example in order to work with telegram you have to build native extension to use java wrapper through JNI. In order to efficiently get tweets with comments for curated accounts we have to maintain set of keys and rely on key rotation mechanism as soon as api start throwing timeouts. Important aspect there – time handling. World are big, complex and have many time zones which sometimes may be reflected with web-page or response in all its cheer differences, so we have to adjust everything to common denominator – UTC – as we have done for market data.

Another direction was to simplify bot management & deployment to decrease operational toil. As we already started use redis for sharing common information (nonce) and maintain alerting messages and order’s queue, next steps was to extend this mechanism to the next level – build sort of message bus for inter-process communication. Other important functions are: automation of deployment of services to a new node in a bit more advanced way that just cloning the whole repo, update configuration for particular server, shutdown all processes at all registered nodes at once. In order to flexibly manage newly added EC2 instances to the system we add ability to deploy agents. It is a special daemon that control arbitrage and supplementary processes at node, listen commands from UI of command centre. From monitoring perspective – it would be much more handy to review state of processes by skim through colour of trading pairs per servers reflecting delays of heartbeat, than ssh’ing inside screen of every server to watch for heartbeat messages in console. Many challenges were overcome on this path as well: flask & dash integration, dynamic UI creation (I am not the biggest fan of frontend) with ability to forward input from newly created forms into redis.

The fall of Rome

And suddenly something changed – just in one single day we no longer were able to place orders in time for few currencies. Either some one plays with scalping (which, in theory, exchanges tries to prevent), or someone were able to prepare efficient ML model to predict price movements within second intervals, or someone just can do it faster. Even now it is not the case for all currencies for all exchanges – as individual or institution have to pull a lot of money to cover everything, but it affects profitability so direct arbitrage approach probably will not bring you too many money. The last attempt that I have made was to switch trading on web-socket basis. Idea was very simple – you pull snapshot of orderbook and subscribe for updates to sync and then maintain actual state locally. Subscription – means event loop in dedicated threads. Asynchronous code with threads not ideal use case for python 2.7 so I devoted good amount of time to add option to show in real-time top bid and asks from both order books for debug purpose. But several trial sessions with this approach doesn’t reveal too many opportunities for profit and show general instability of subscription itself that may lead to re-syncing and potentially error-prone.

Lessons learned

  1. Exchanges – just another piece of software with all related issues of big complex systems – reliability, consistency, performance. So you have to deal with it with all possible precaution: stale orderbook, delays with balance updating and order placement, errors management.
  2. Vps is not real hardware – you still share resources with other processes. First time I saw at htop metrics steal time!
  3. Python for mid size projects – great speed of development, rich ecosystem – google it and use. Be careful as you never know how that guy implemented those function that you need inside library – if you care about performance – try to minimise external dependencies as much as you can.
  4. Implementation of some builtin algorithms may be not as you expected – check set intersection or insort.
  5. It is useful to use some weird default values: -1, 100500 – when you stumble across them in logs you (almost) 100% sure that something went wrong.
  6. If not sure – better just shutdown process.
  7. Shell programming – ancient mortal art that can save a lot of time for supplementary task for example to analyse logs.
  8. Bugs related to lack of types — stupid and annoying errors that requires long time to troubleshoot
  9. Code indentation – when by mistake block of code moved inside or outside of cycle or condition – unique «feature» of python – also require thorough investigation and not so easy to find
  10. python and date
  11. python and float
  12. cost of function call may be high – inlining may lead to code duplication but may be faster

On interview preparation: nuances of python

From my perspective mastering particular language is still secondary skills for programmer.
For sure it will affects architecture of software and may add advantages for some use cases of handling data flows, but overall, in the long run, decent design of system coupled with carefully catered algorithms much more important.

Nonetheless within this article I want to highlight several topics that usually distinct python veteran from novice. Part of these questions I was asked about during recent interviews, some of them I had asked myself to distinct people, who name themselves python expert from those who really are, based on long exciting hours of debugging sessions from real products.

Why I want to start with python and not serious languages c++ or scala?
Currently its de-facto standard second language accompanied more performant one and
in many domains it is considered to be a primary language because of simplicity of entry,
rich ecosystem and variety of libraries.

Interpretators:

  • CPython – default one
  • Pypy – If you need a bit faster python (JIT compiler with faster version of garbage collector)
  • Jython\IronPython – in case you need integration with java or .Net world
  • Cython – you need more speed, numpy, pandas and scikit can not fullfil what you need,
    but you still afraid to accept that c++ may be your best friend.

Dependencies management:

Standard practice is to isolate dependencies within dedicated sandbox of virtual environment with all required packages and python interpreter itself installed in such way that it doesn’t confronting with system wide installed libraries.

virtualenv is mainly used for it, but virtualenvwrappers may be more handy. In case it will be necessary to maintain several python versions simultaneously Pyenv can help.

Requirements.txt – should contains list of all dependencies.
Using –index-url you can specify your own private package servers instead of default one – pypi – including private repos with access through ssh or just local file path:

file:///path/to/package#egg=my_package

Distribution and Installation:

  • setup.py – during automatic package installation requirements.txt are not analysed
    anyhow – what really matter is setup.py file used by setuptools. Its worth to be aware about install_requires directive, that should contain list of all dependencies.
  • wheel (or egg) as common distribution formats.

Popular language’s caveats:

Mutable vs immutable datatypes

You can’t change int or string value  of some variables – you have to replace it with new value. But you can add element to list or remove key from dict. Custom objects usually mutable – you can go crazy with altering their fields or adding new methods.

Why it maters?

Mutating default arguments – perfect case for hard to find bugs:

def some_method(b=[]):     # WRONG, use b=None instead
    b.append(1)
    print(b)
    return sum(b)

for x in xrange(3):
    some_method()

Arguments are passed inside methods by reference i.e. it is not copy of object.
In case they immutable and you try to modify them inside of function – new local instance will be created – those procedure named rebinding, if they are mutable – you can end up modifying originally submitted value.

How to make sure that it is actually mutating?

x = 100500
y = 100500

id(x) == id(y) # False, values are the same but objects are different

Method id return identity of object (sort of virtual address of variable if you are used to C terminology)

z = x
id(x) == id(y) # True, equal!
z = 127
id(z) == id(y) # False, as we make re-assign z to another value

Note however:

x = 250
y = 250

id(x) == id(y) # will be True 
"""because of optimisation of interpreter for small numbers:
https://docs.python.org/2/c-api/int.html"""

x = "STR1"
y = "STR1"

id(x) == id(y) # may be True as well
"""
interpreter's implementation specific
for immutable types, operations that compute new values may 
actually return a reference to any existing object with 
the same type and value:
https://docs.python.org/2/reference/datamodel.html#objects-values-and-types
"""

Operator is  is doing exactly this – comparing ids of two objects – whether they both reference exactly same memory cell.

So with brief ideas about value and reference, we can tackle equality operation:

class A():
    def __init__(self, a):
        self.a = a

x1 = A(1)
x2 = A(1)

# Both will be False
x1 == x2
x1 is x2 # equal to id(x1) == id(x2)

By default, comparison operator == invoke __eq__ method and it use ids of objects for comparison.

Truthiness vs “is None”

class A:
    def __init__(self, a):
        self.a = a

class B():
  __bool__(self): return False

l = [float('nan'), 0, 13, 0.0001, "", "False", [], A(None), B(), False]

for entry in l:
    if entry:
        print("{} - True".format(entry))
    else:
        print("{} - False".format(entry))

Is None – is nothing else as straightforward check whether some variable reference the same address as None value. But what checks are performed during invocation truthiness check for some_object? Under the hood it just evaluate objects – check for existence of implementation of __bool__ or len method of object, if they are not present – treat object as True.

Hashing in python

Common question to confuse people – why by default hash is defined for immutable objects only:
reasons will be very obvious if you think about example below:

class A():
    def __init__(self, a):
        self.a = a

    def __hash__(self):
        return hash(self.a)

    def __eq__(self, other):
        print("__eq__ invoked!")
        return self.__class__ == other.__class__ and self.a == other.a

container = set()
c = A(1)
d = A(10)
container.add(c)
c in container # True
d in container # False
c.a = 123
c in container # False

I.e.  after we change value of object’s attribute – we can’t really check whether it was in container. That make sense! But what if we do this one:

d.a = 1
A(1) in container # False, and call of eq method
d in container # False, and call of eq method

I.e.  such kind of checks always invoke __eq__ for cases when hashes are similar in order to deal with possible hash collisions.

What it mean from practical point of view:

  • if object1 equals to object2 their hashes also should be the same
  • in theory it is possible to use identity of objects for hash computation and compare values in case hash collisions

What will happen if we try to compare lists?

a = [1, 2, 3]
b = [3, 2, 1]
c = [1, 2, 3]

a == b # False
a == c # True

Check equality for every pair of elements with the same index – so order matters!

Copy – shallow and deep

What if we add this to our equation?

d = a
a == d # True
print("{} = {}".format(id(a), id(b)))

I.e. we have similar address that d & a referenced.
Due to it – if we add element to a, d will also have this element.

It mean that in case we need two independent instances of similar list we need to use
deepcopy. And it will be necessary for every mutable type (or collection).

Python sort of names mangling:

Python offer interesting approach to prevent name collisions in case of inheritance.
Knowledge of this trick can save your time while working with someone’s code.

class B():
    __attr1 = None

u = B()
u.____attr1 # AttributeError: B instance has no attribute '__attr1'
u._B__attr1 = 123 # All right then

Slicing tricks:

Reversing string, pythonic way:

a = "Acapulco"
a[::-1]

And familiarity with slice notation

array[start:end:step]

may be beneficial by providing very compact code:

g = [x for x in xrange(10)] # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
g[::2] # Even only [0, 2, 4, 6, 8]
g[-3:] # Last three elements [7, 8, 9]

Init list with similar values:

b = [-13] * 10

But don’t get in trap of using:

b2d = [[-13]*10]*10 - # WRONG! will create 10 reference to the same list

this is valid way of doing this:

b2d = [[-13] * n_cols for i in range(n_rows)]

How to work with resources using with notation:

with open("in.txt", "r") as input, open("out.txt", "w") as output:
    output.write(input.read())

Benefits – implicit handling of resource release in case of any exceptions

And yeah, nothing will stop you from adding to your own objects similar functionality
just add to them __enter__ and __exit__ methods.

Talking about exceptions:

Keep in mind:

  • possibility to rely on exception’s inheritance hierarchy
  • using else when no exception is raised within try block
  • finally is similar to what you may seen in other languages – will be executed anyway.
import random


class MySimpleException(Exception):
    """Base class"""


class MyMediumException(MySimpleException):
    pass


class MyCriticalException(MySimpleException):
    pass


class MyCustomException(Exception):
    pass


def some_method():
    res = random.randint(0, 10)

    if res == 0:
        return res

    if 1 &amp;lt;= res &amp;lt; 3:
        raise MySimpleException("MySimpleException")
    elif 3 &amp;lt;= res &amp;lt; 6:
        raise MyMediumException("MyMediumException")
    elif 6 &amp;lt;= res &amp;lt; 9:
        raise MyCriticalException("MyCriticalException")

    raise MyCustomException("Bad luck today!")


def exception_handling_example():
    try:
        some_method()
    except MySimpleException as e:
        print("Any exception within class hierarchy: " + e.message)
    except Exception as e:
        print("Any exception" + e.message)
    else:
        print("No exception at all")
    finally:
        print("This will be executed no matter what")


for x in xrange(10):
    exception_handling_example()

Iterators, generators & yield:

x = range(1000000) # Python 2.7.*
import sys
sys.getsizeof(x) # 8000072

def my_generator(n):
    num = 0
    while num < n:
        yield num
        num += 1

x = my_generator(1000000)
sys.getsizeof(x) # 80

Range will create list with all elements in memory.
my_generator(1000000) will create special object – generator – that can be used in loop –
in our example it will produce values from 0 up to 1000000 – 1 in lazy fashion:

next(x) # 0
next(x) # 1
next(x) # 2

Many familiar with syntax of list comprehension but may find this syntax a bit confusing:

a = (u * 2 for u in g)

but this is comprehension as well – it just return generator

This is not:

x = xrange(1000000) # &amp;gt;&amp;gt; type 'xrange'
next(x)             # TypeError: xrange object is not an iterator
x = iter(x)
next(x)             # 0

Iterator – is object that have implemented both __next__ and __iter__ method.

Generators is just one particular case of coroutine:

  • within method we have list of instruction,
  • when we hit return, all subsequent calls of this method will re-execute all those instructions.
  • when we hit yield – current state of method (last executed instructions and values of local variables) will be saved and subsequent calls to this method continues from the last statement.

Few words about import in python:

First of all a bit of terminology:

  • module – is just *.py file (or C-extensions – *.pyd or *.so, built-in usually are C-extensions as well)
  • package – set of modules under some folder (for python 2.7.* – it must contains __init__ file)

When we write:

import X
# Just a shortcut to use: xyz.method(args)
import X.Y.Z as xyz
# Will bring to global scope method, variable or complete module Y
from . import Y
# Will bring to global scope EVERYTHING from module X, including its sub-imports
from X import *
  1. Import search module in sys.modules cache that contains previously imported modules.
  2. If not found – search among built-in modules.
  3. And as the last hope – search by filename in every directory mention in sys.path in order of appearance (different version of the same modules in two paths – will load first one)

When module to be loaded is finally found – it will be parsed, interpreter set few variables before execution – for example __name__, and finally execute – hence those famous check for main:

# executed when we imported this module from other code:
# import x.y.z.name_of_this_script, 
# __name__ = "x.y.z.name_of_this_script"
print("module loaded")

if __name__ == '__main__':
    # executed when we run "python ./x/y/z/name_of_this_script.py"
    print("module executed as main")

Parsing may be slow – it first check for pre-cached versions – *.pyc files (with all related last time changed vs timestamp of cached version).

Importing will change global variables and therefore perform some blocking to ensure that nothing
will screw up – even if modules to be imported already in sys.modules cache.

Usually two types of import are distinguished:

from a.b.c.d import Z   # absolute - can be a bit verbose
from . import Z         # relative - add Y from current package

These is just syntax sugar for more compact code:

import X.Y.Z as xyz
xyz.some_method()

from X.Y import Z
Z.some_method()

It still will be loaded fully – not just Z but X, all imports from X, Y, all imports from Z, and finally Z (with all imports as well).

Circular imports

Imagine following project structure:

project
    module1
        __init__.py
        script1.py – contains line from module2.script2 import method2
    module2
        __init__.py
        script2.py - contains line from module1.script1 import method1

python -m module1.script1

When you try to run script1 python will try to resolve all imports first:

  • ok, we need import module2.script2 – lets load all imports for it,
  • within script we have to import module1.script1 – lets load all imports for it,
  • within script we have to import module2.script2 – lets load all imports for it,

In most cases it is matter of proper design, but nonetheless changing imports to be

import module2.script2

will load it once.

Another common approach to fight import’s issues is to use deferred importing:

# imports
# some method

def very_important_method():
    from another_module import smth
    # code

Lambdas

f = lambda x: x.some_attribute
type(f) # type "function"

Single expression – i.e. it should return something and fit to single expression – i.e. sequence of statements that return some value. Example of usage: map, filter, sort.

Memory optimisation

Python is not the most efficient when we talk about performance or RAM consumption:

import sys

class A(object):
    pass

a = A()

sys.getsizeof(A) # 904 0_o
A.__dict__
dict_proxy({'__dict__': &amp;lt;attribute '__dict__' of 'A' objects&amp;gt;, '__module__': '__main__', '__weakref__': &amp;lt;attribute '__weakref__' of 'A' objects&amp;gt;, '__doc__': None})

sys.getsizeof(a) # 64
sys.getsizeof(a.__dict__)   # 280
# but we have freedom to add object and method of object's instance
a.new_int_attribute = 123

Main approach is to explicitly define attributes that should be available for object using __slots__ specifier:

class B(object):
    __slots__ = ['a']
    def __init__(self, a):
        self.a = a

sys.getsizeof(B) # 944
B.__dict__
dict_proxy({'a': &amp;lt;member 'a' of 'B' objects&amp;gt;, '__module__': '__main__', '__slots__': ['a'], '__init__': &amp;lt;function __init__ at 0x109a287d0&amp;gt;, '__doc__': None})

b = B(1)
sys.getsizeof(a)  # 56
sys.getsizeof(a.__dict__) # AttributeError: 'B' object has no attribute '__dict__'

a.new_attr = 23 # B object has no attribute new_attr

Note, that apart of __dict__, another attribute will be missing from objects with implicit definition of __slots____weakref__ – object containing ALL weak references to current object.

Additionally it is possible to use something like namedlist – to create objects that not participate in garbage collections – full comparison of various approach for save a bit of memory is discussed here.  Usually for profiling it is sufficient to use dedicated library like profilehooks.

But nothing is stoping you from diving into byte code as well:

def a():
    print("a")
print(a.__code__)    # code object a at 0x109b2bc30
import dis
dis.dis(a)
dis.dis(a)
  2           0 LOAD_CONST               1 ('')
              3 PRINT_ITEM
              4 PRINT_NEWLINE
              5 LOAD_CONST               0 (None)
              8 RETURN_VALUE

compile("print('Hello world')", '', 'single')   # code object &amp;lt;module&amp;gt; at 0x10db313b0
exec compile("print('Hello world')", '', 'single')   # Hello world

Python is a language with automatic garbage collection. It has two key components: reference counting and generation based garbage collections. When del method is invoked or variable goes out of scope – nothing is actually deleted right away – first number of references is decremented, when it become zero – then memory occupied by object will be available for interpreter again.  Programmer can disable completely generational GC or disable it for particular class by defining __del__ method:

import gc
gc.disable()   # For everything
# OR
class A():
    # ...
    def __del__(self):
        # details
    # ...

gc module allow to manually trigger garbage collection or hunt down possible leaks. It is implementation specific but generally memory will not be returned to OS.

Threading

Python interpreter is single threaded itself – famous GIL. So it may not be the perfect choice for compute-bound application. But for I\O based load – threads in python good enough, especially with gevent in python2.7 or asyncio in 3.7 Alternative is to use multiprocessing module. Or some more suitable language!

Function definitions:

Positional and named arguments – that should be simple:

def method(a, b, c):
    print("a:{a}b:{b}c:{c}".format(a=a, b=b, c=c))

method(b=1,c=5, a="whatever")   # named: will print a:whateverb:35c:1
method(1,5,"whatever")          # positional: will print a:1b:5c:whatever

args  and  kwargs used to specify variable number of arguments. kwargs is named and args is positional

def method(*args, **kwargs):
    for _ in args:
        print(_)

    for k, v in kwargs.items():
        print("{} = {}".format(k, v))


method(1,2,3, a="a", b="b", c="c")
1
2
3
a = a
c = c
b = b

Other

Few useful built-in modules to be aware of and not re-invent the wheel:

  • argparser, configparser
  • itertools – permutations, combinations, product, imap
  • collections – defaultdict, Counter, ordered dict, namedtuples
  • re – regular expresions
  • os, sys, random

Not so standard but handy:

  • requests – to talk with web via POST & GET
  • scapy – for playing with tcp packets
  • scrappy\beautiful soup – to crawl web
  • numpy\panda for memory optimised collections
  • nose – if standard unittest is not enough
  • dash – visualisation

P.S.: Which version to use 2 vs 3:

As for now – in the middle of 2019 – more historical remark:

  • chained exceptions stack trace
  • unordered exceptions during comparison 1 > ‘2’
  • yield from generator_method()
  • faulthandler – tracing kills (except SIGKILL)
  • More handy modules: ip_address, lru_cache, enum, pathlibs, dataclasses import dataclass
  • unicode strings
  • type hints & annotations
  • iterable unpacking
  • dict items order guaranteed 3.7+
  • speed

On interview preparation: algorithms and data structures recap

With this article I want to go a little bit further than widespread memos “X most common data structures for interviews”. I am not only interested in information about asymptotic complexity. The real value – understanding whether particular algorithms or data structure can be beneficial in particular case: real life tasks rarely given with hints like “Use directed graphs, Luke!” or which comparison function should you implement to have your favorite sort algorithm work accordingly to upper bound complexity from textbooks. On this page you can find not just theoretical high level summary but also few facts that I am considering important, practical or just afraid to forget.

Sorting

Few considerations worth to notice before speculating about “Best” sorting algorithms:

  • How much data we are dealing with?
    • Dozens numbers or billions?
  • What is nature of data – i.e. what is cost of element comparison?
    • Integer\strings\complex objects? For example it may involve disks seeks and we should aim for algorithms that minimize comparison or even stick with non-comparison based algorithms?
  • What is the cost of elements transfer?
    • Should we aim to algorithms with minimum number of swaps/data copying
  • What is current state of data? Does it partially sorted?
    • In case of partially sorted data some algorithms may touch they upper bounds but adaptive class of algorithms can deal with this input in efficient manner
  • Are data allow random access?
    • For example: linked list vs arrays. Classical quicksort on linked list always be inefficient.
  • Are data fit to the memory?
    • Otherwise we need modifications of algorithms that support external sorting.
  • Are we are dealing with memory constrained environment?
    • Can allow auxiliary space or have to stick with in-place algorithms?
  • Is there possible duplicative values in array to be sorted?
    • Do we need stable sorting – i.e. should they keep their initial order from input array?

Comparative based algorithms – i.e. we compare the whole values of input array (as opposite to distinct bits of those values).  Any comparative based sorting algorithms lower bound worst case complexity – N log (N).

Comparative based algorithms – out of our league – not used in reality, but sometimes can be asked during interview

Algorithm Time complexity Space complexity Comments
Best Average Worst Auxiliary
Bubble sort O ( N ) O ( N ^ 2 ) O ( N ^ 2 ) O ( 1 )

In-Place.
Stable.

Select sort O ( N ^ 2 ) O ( N ^ 2 ) O ( N ^ 2 ) O ( 1 ) At i-th iteration find index MIN of the smallest remaining entry and swap it i-th element with MIN.
N Exchange. (Stable for list.)

In-Place.
Non-stable.

Practical comparative based algorithms

Algorithm Time complexity Space complexity Comments
Best Average Worst Auxiliary
Quicksort O (N Log N) O (N Log N) O ( N ^ 2) O ( Log N ) O( Log N ) probabilistic guarantee.
Pick a pivot, partition array in order to group all elems less than pivot in the left subarray, and those which are bigger – to the right.Recursively apply.
Pivot may be chosen randomly but there are better strategies like median of medians.
In-Place.
Stable.
3-Way Quicksort O (K * N) ? O ( (2 ^ k) * N) O ( Log N ) Used in case of duplicate keys.
Pick a pivot element and partitions elements into three sets: less, equal or greater than pivot.
There are variants with two pivots.

In-Place.
Stable.
Mergesort O (N Log N) O (N Log N) O (N Log N) O ( N )

 

NOTE: For linked lists:
O ( 1 )

Divide array into two sub-arrays and sort them recursively and merge in sorted order.
In simplest version – till sub-array have size 1, in more optimised – till it reach size of threshold (10-20 elements) and another sorting algorithms can by applied (insertion sort for example).
Easy modification for parallel and external sorting (aka not-fit in memory).
Less comparison than quicksort.
More element transfer than quicksort.
Not In-place.

Stable.
Heapsort O (N Log N) O (N Log N) O (N Log N) O ( 1 ) Construct a max heap mapping array’s indexes to binary search tree.
After constructing recursively delete root from tree until it is not empty.

In-Place.
Non-stable.
Insertion sort O ( N ) O ( N ^ 2) O ( N ^ 2 ) O ( 1 ) At i-th iteration swap i-th element with each larger entry to its left.
Useful for partially sorted array or small arrays – less than 10 elements.

Adaptive.
In-Place.
Stable.
Shell sort O ( N ) ? O (N ^ (3 / 2))

O(N * (Log N) ^ 2)
O ( 1 ) Like insertion sort but we rely on D-order sort, where D – gap between sorted entries.
All estimations depend on choosing gap sequence.
Used in embedded devices due to non-recursive nature – no need for support deep and memory hungry call stack).

Adaptive.
In-place.
Non-stable.
Timsort O ( N ) O (N Log N) O (N Log N) O ( N ) Based on Merge sort and Insert sort:
tries to find a consecutive sequence of ordered elements and merge them to decrease necessary processing to final results.
Used in Java and Python as default sorting algorithm.

Adaptive.
Not In-place.
Stable.
Introsort O ( N ) O (N Log N) O (N Log N) O ( N ) Hybrid between quicksort & heap sort.
Used in c++ as default implementation of sorting.

Adaptive.
Not In-place.
Non-stable.
Treesort O (N Log N) O (N Log N) O (N Log N) O ( N ) Build binary search tree from elements and traverse it in-order.

Variations with self balancing binary search tree or Splay tree (Adaptive).
Estimations based on usage of balanced trees.

Not In-place.
Stable.
Cubesort ? ? O (N Log N) O ( N ) Parallel sorting algorithm

Not In-place.
Stable.

Noncomparison based sorting algorithms. For such kind of algorithms we do not compare the whole keys (values) but using individual bits of values (for example characters for strings or digits for numbers) – it allows to achieve linear time O(N) complexity. Price for it – necessity to tune it for every new type.

Algorithm Time complexity Space complexity Comments
Best Average Worst Auxiliary
Radix sort O ( N * K ) O ( N * K ) O ( N * K ) O ( N + K ) Examines individual bits of keys.
K – max number of digits in array’s elements.
Applicable for integer numbers, string and float numbers in special formatting.
Worth to use in situation when N >> K and K is fixed – famous question about sorting of million of 32-bit numbers.
MSD vs LSD – most significant vs least significant digits.
In-Place.
Stable.
Bucket sort O ( N + K ) O ( N + K ) O ( N ^ 2 ) O ( N * K ) Partitioning an array into a number of buckets.
Each bucket is sorted either by recursive application of bucket sort or using other algorithms.
Can be used as external sorting algorithms.
Asymptotic given for cases when K ~ N.
Applicable in case data evenly distributed over a range.
Counting sort O ( N + K ) O ( N + K ) O ( N + K ) O ( N + K ) Count number of keys related to item, use this count to determine position of elements.
Asymptotic given for cases when K ~ N.
Applicable in case N ~ K.
Stable.

N-th biggest\smallest element

just a variations of partial sort:

  • QuickSelect – worst case O(n^2) best case and average – O(n) – as quicksort with pivot based partitioning but move in one direction only
  • Median of medians – improve worst case of quickselect to O(n) – partition input to group of several(5) elements, find median within every group, and return median of n/5 medians
  • Introselect – back up nth_element at C++ STL – start as classical quick select and switch on median of median method in case depth of recursion is too deep and at the same time sizes of sub-partitions observed so far not leading to halving.

Hashing

Family of hashing functions: H(X) = ((a * x + b) mod p) mod N) a,b < p, a!= 0, p – prime number used to reduce number of collisions (it depend on your keys distribution).

Two main approaches: separate chaining vs open addressing (linear probing) with double hashing.

  • Separate chaining – array of lists, at the very worst case – O(N) to retrieve record!
  • Open addressing – circular buffer – i.e. in case of collisions try to insert record to next un-occupied cell – which lead to poor performance in case of bad choice of hashing function. Load factor – how it will perform if 25\50\80 % of cells of array will be occupied. Used for hashtable implementation in python.

Separate chaining advantages:

  • Easy implementation of delete method
  • Performance degrades gracefully
  • Clustering is less sensitive to poorly designed hash functions

Linear probing advantages:

  • Less wasted space (pointers)
  • Better cache performance

There are various use cases & requirements for hashing that affect characteristics of good hash function:

  • Quickly insert\return records with mapping Key to Value – cheap cost of computing hash and low number of collisions
  • Find similar or close entries – need to maximise collisions if keys meet some criteria – locality sensitive caching (for example GeoHash) with rolling hash to speed up computations, another famous example: Rabin Karp string search algorithm.
  • Make it difficult to find key corresponding to particular hash value – one way function – Cryptography related hashing – fixed size output for any input, no collisions and at the same time almost impossible (time and computation consuming wise) to find input that produce particular hash
  • Cornerstone of more advanced data structures: bloom filter, merkle tree, distributed hash table.
  • Interesting overview of details of hash implementation in various languages can be found here

Heaps

Root always contains element that satisfy heap property: min-heap or max-heap. Elements not sorted but partially ordered.  Heaps may be constructed in O(N) using Floyd algorithm. Usually used as underneath structure for priority queues. Leftist heap as example of pure functional data structure.

Heap’s underneath datastructure Time complexity of operations Comments
Find Max Extract Max Increase Key Insert Delete Merge
Linked List (unsorted) O ( N ) O ( N ) O ( 1 ) O ( 1 ) O ( 1 ) O ( 1 )  
Array (sorted) O ( 1 ) O ( 1 ) O ( N ) O ( N ) O ( 1 ) O ( N + M )  
Binary Heap O ( 1 ) O (Log N) O (Log N) O (Log N) O (Log N) O (N + M) Heapify: O(N)

 

Binary Heap (array representation) – ordered complete binary tree – parent’s key no smaller than children’s key. Complete binary tree.
Binomial Heap O (Log N) O (Log N) O (Log N) O (Log N) O (Log N) O (Log N) A binomial heap of order K has a root whose children are roots of binomial trees of orders K-1, K-2, … , 2, 1, 0 – in this order.
Each binomial tree is a heap obeys the minimum-heap property: the key of a node is greater than or equal to the key of its parent.
There can only be either one or zero binomial trees for each order, including zero order.

Support quick merging of two heaps.
Fibonacci Heap O ( 1 ) O (Log N) O ( 1 ) O ( 1 ) O (Log N) O ( 1 ) Collections of trees.
Used for priority queue.
Degree of nodes (degree means the number of children) are kept low;
every node has a degree at most O(Log N) and the size of a subtree rooted in a node of degree K is at least F(K) + 2, where F(K) is K-th Fibonacci number.

Amortised constant factor affect practical usage. High memory consumption.

Tree

Full tree – every node except leafs have non-null children. Complete tree – full tree that have all nodes at the last level are lean left.

Tree traversal:

  • Pre-order – via dfs
  • In-order – via dfs
  • Post-order – via dfs
  • Level-order – via bfs

Search data structures

  • Size of element (read as cost of access\copy\compare operations)
  • Volume of data: can it fit to RAM?
  • Usage patterns:
    • evenly mixed read & writes or skew for particular operations?
    • access single key or range?
    • access to value using the whole key or interested to find all values where even parts of key matches?
Data structure Average time complexity Worst time complexity Auxiliary Space Comments
Indexing Searching Insertion Deletion Indexing Searching Insertion Deletion
Arrays O ( 1 ) O ( N ) O ( N ) O ( N ) O ( 1 ) O ( N ) O ( N ) O ( N ) O ( N ) Cache friendly, low memory overhead.
Pre-allocated vs dynamic.
Linked list O ( N ) O ( N ) O ( 1 ) O ( 1 ) O ( N ) O ( N ) O ( 1 ) O ( 1 ) O ( N ) Size of data vs size of pointer(s).
One directional vs bi-directional vs circular.
Sequential access vs random.
Internal or external storage (i.e. node contains only reference to actual data)
Foundation of stack, queue and dequeue.
Skip list O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( N ) O ( N ) O ( N ) O ( N ) O ( N Log N) Allow fast search using additional references hierarchy: next, the next after next, every 4th, i-th level – should contain 2 ^ i elements, but it used probabilistics skips – i.e. at i-th level, next reference may point to 2^i element or further.
Alternative for hash table and balanced tree – theoretical worst case performance can be worse, but in practice usually faster – depend on nature of data and cost of comparison operator.
Foundation of various lock-free implementations for dictionaries and priority queues. Used in Redis, LevelDb and Lucene.
Hash table (symbol table, associative array) ? O ( 1 ) O ( 1 ) O ( 1 ) ? O ( N ) O ( N ) O ( N ) O ( N ) Not best in case of small number of keys in comparison to direct mapping key -> array index. In practice, particular operation can be slow because of array resizing and cache un-friendly nature may lead to poor performance in comparison even with O(N) brute force lookup of simple arrays. C++: unordered_map, multimap
Python: dict
Java: HashMap
Binary Search Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( N ) O ( N ) O ( N ) O ( N ) O ( N ) Complete if it perfectly balanced, except for a bottom level. In worst case (depend on insertion order) it can be N-height. Constructing as quick sort. Application of problem: 1D-range search.
Cartesian Tree ? O ( Log N ) O ( Log N ) O ( Log N ) ? O ( N ) O ( N ) O ( N ) O ( N ) It is heap-ordered and symmetric (in-order) traversal of the tree returns the original sequence. Used for RMQ (range minimum query) kind of tasks.
B-Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( N )

Not just two child anymore => heigh of tree lower => lookup faster.

Various modifications.


B+-tree – actual data stored within leafs, intermediate nodes contains copy of keys.
Core data structure for relational databases as it allow to minimise disk seeks –
implemented in Berkeley DB.

2-3 Tree O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) O (C * Lg(N)) ? Maintain symmetric order and perfect balance.
Red-Black Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O( N )

Balance is preserved via painting different nodes in Red\Black colours to satisfy some predefined properties. Re-arranging and re-colouring can be performed efficiently.

For example: no node has 2 red links connected to it. Red links lean left. Every path from root to null link has the same number of black links – perfect black balance.

C++: set and map.

Java: TreeMap and TreeSet (and in some situation HashMap fallback on their usage as well!)

Splay Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O( N )

Splay Tree is a self-adjusting binary search tree with additional property that recently accessed elements are quick to access again.

Applications: cache, garbage collectors, memory allocators.

AVL Tree O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O ( Log N ) O( N ) Balance in AVL Tree – the heights of two child subtrees of any node differ by at most one. AVL Tree better for look-up intensive applications than Red-Black Trees
Kd Tree ? O ( Log N ) O ( Log N ) O ( Log N ) O(N Log N) O ( N ) O ( N ) O ( N ) O( N )

Space partition Binary Tree. Used for range search and nearest neighbour search. Constructing N Log N. Not suitable for high dimensional space – better to use when N >> 2^k.

Octree for 3D space.

More advanced structures worth to be aware of:

  • Trie (Prefix Tree) and its variations, mainly Radix Tree for lookup intensive applications. Worst case – O(M), where M is the length of string to search. DAFSA – can be a great alternative for cases with limited memory.
  • Generalized Suffix Tree – search efficiently with mismatches, applications for various string related tasks: longest common substrings, longest repeated substrings, longest palindromic substring,
  • Fenwick Tree (Binary indexed tree)– common use case: modify element of array and return result of some inverse operations for range of elements (for example partial sum but not max). Simpler and faster in comparison with segment trees. O(N) build time.
  • Segment Tree – classical approach for RMQ kind of tasks, operation not necessary to be inverse. Double memory requirement in comparison to fenwick tree. Constant factor affect speed in practice, especially in cases of multiple dimensions.
  • Implicit Treap – sequence of pairs, heap ordered by 1st entry, BST ordered by the 2nd. If we use index of element in array as first entry – than we can
    • insert\remove element at any position
    • move subset of elements in any position
    • computation of some function for elements within some interval [i, j] – sum, min – including range updates using lazy propagations
    • foundation for Rope implementation – to have similar asymptotic for very long strings
    • and all of the above in O(Log N) – with huge constant factor though
  • vEB Tree – ordered dictionary or associative array much faster (exponentially) than BST, memory overhead
  • Heavy-Light Decomposition: split tree on multiple not-crossing paths. Edge from A to B is considered heavy if number of vertexes reachable from B is at least half as big as number of vertexes reachable from A. Heavy-Light Decomposition – union of light vertexes from bottom of tree to the root. Such decomposition allows us to serve queries like “Get some metric on the path between A and B” – for example sum or max. Not needed in case we do not have update queries. Usually used with segment tree and LCA.

Union Find (Disjoint set)

Approaches: colouring, trees on top of arrays using ranks\randomisation\path compression & merging either randomly or using ranks of trees. Additional information can be stored with leaders of set.

Applications: offline LCA, offline RMQ, dynamic connectivity variables dependencies during program analysis, variations of percolation problem – connections in social network, fluids or electricity flow.

Graph

Representation: incidence matrix vs adjacency list. Generally adjacency list is preferable for sparse graph, but there are variation of incidence matrix based on associative array where pair of vertexes used as a key.

Possible variations:

  • Complete
  • Directed
  • Weighted
  • Planar
  • Bipartite 

Cut – set of vertexes from graph, if they will be removed – graph become disconnected. Size of min cut – connectivity or K-connectivity.

Graph related tasks

  • Find cycles in graph – union find (complexity – Akkerman function from number of vertexes) or DFS – proportional to number of vertexes
  • Determine whether graph is bipartite – dfs
  • Eulerian path – route visiting every edge exactly once (exist if all vertexes has even degree and it is connected) – search of such path can be done in linear time O(|E|)
  • Hamiltonian path – route including every vertex exactly once (as in Travelling salesman problem)
  • Comparison of graph – graph’s isomorphism (still open question N or NP)
  • Graph representation in such way that there are no edge crossing, except in vertexes (upward planar graph)  – there are some special cases where such checks can be done in O(N) but generally it is NP-hard problem. 
  • Minimum Spanning Tree (MST) – application is design of network (For example to avoid cycle of packets in Ethernet) or clustering, with custom distance function.
    • Prim – adding edge to current vertex with lowest weight
    • Kruskal – gradually merge trees from all vertexes using edge with lowest weight for connection

 

For directed graphs:

  • Topological sorting – tasks with ordering limitations, schedule related tasks (DFS) – create schedule in case we do not have cycle – re-draw graph in such way that all path follow from bottom to top – reverse dfs postorder (additional stack) = result
  • path finding – Dijkstra, if weights are negative – Bellman-Ford or its optimised version – Shortest Path Faster Algorithm (SPFA)
  • Shortest path for multiple sources – dense graph with negative weights – Floyd-Warshall or not dense – Johnson’s algorithms or Dijkstra 
  • Search for cycle – DFS – detection of deadlock in concurrent systems or topological sort
  • Maximum flowFord–Fulkerson algorithm , Bipartite matching problem, baseball elimination
  • Dynamic connectivity
  • 2-SAT – satisfy system of constraints of boolean variables – scheduling, medicine testing. Can be represented as directed implication graph. Checking whether two vertexes are within the same connected component allow to answer question whether we can satisfy all conditions.

Substring search

  • Knuth-Morris-Pratt (KMP) – based on DFA (deterministic finite state automaton) that have to be pre-computed for pattern. O(R + N) where R – length of pattern and N – length of text. O(RN) space.
  • Boyer-Moore – scan from the end of pattern and in case of mismatch skip up to R characters – worst case is as brute force O (R * N) but on average considered the most efficient algorithms – O (N / R).  O(R) space.
  • Rabin-Karp – compute sliding hash for every substring of length R for hashing can use modular hashing with some prime number. For efficiency sake Horner’s method can be utilised to re-use computation from previous steps – O(R). Hash match – still can be a collisions so there are two variations in case of hash matches. 2D search or search for multiple patterns using pre-computed table of hashes. O(1) space.
  • Aho-Corasick – in case we are dealing with multiple patterns and text where we search them are different. Build trie for all words, extend it to special finite state machine without backtracking – for failed patterns using suffix links. Used in grep – -F flag.
  • Z-functions: Given string A, consider array of all possible suffixes: A[len(A)], A[len(A) – 1], … , A[1]. Compare symbols from the start of string and every suffix – number of matched symbols – represent particular component of Z-function. For sub-string search: concatenate pattern, unique delimiter and original string and check if values of Z-function equal to length of pattern.

Playgrounds for preparations

Собеседование на программиста в 2019 – к чему готовиться

Осень-зима 2018-2019 года выдались у меня на редкость насыщенными: ориентировочно, с октября по февраль я поучаствовал в доброй сотне собеседований на позиции уровня Senior-Principal-Lead. Четыре из них были из России (ну мне честно было интересно как тут у нас сейчас), остальные Европа и Англия. Куда-то дальше по разным причинам не хотелось – хотя и пообщался с парой компанией из солнечной Австралии. В этой заметке хочется поделиться накопленными наблюдениями для тех кто задумывается о смене работы, в первую очередь, ориентируясь на иностранные фирмы. Сначала немного сухой статистики, а потом личные выводы.

Первый контакт: процентов 70 представители компаний или рекрутеры находили по профилю в linkedin. С остальными связывался сам – преимущественно откликаясь на любопытные мне вакансии со stackoveflow jobs\hacker news who is hiring\angel.co.

Этапы собеседований

Знакомство

Опциональная беседа с ичаром компании или рекрутером. Цель определить что вы:

  • можете связать пару слов по английски (здесь будет полезно подготовить кратенький монолог на тему “последние 5 лет моей карьеры”)
  • прояснить круг обязанностей по позиции – надобно только писать, помогать писать другим или перед всем этим сначала еще обговорить с клиентами что собственно делаем
  • обсудить желаемую локацию – были случаи когда указаная страна внезапно менялась на другую, например по причинам визы или так как самые hardcore разработчики исторически осели там
  • удостовериться что вы не страдаете особыми закидонами в плане общения.
  • в ряде случаев – определить уровень компенсации дабы не было в конце взаимного разочарования из-за бездарно потраченного времени. Это бывает удобно когда у компании строго определенный бюджет и выйти за него они не могут, а вам меньше тоже не особо интересно. После того как первый раз мне выкатили офер процентов на 40 ниже того что я ищу – в не очень именитые фирмы этот вопрос поднимал всегда до каких либо технических собеседований.
  • обсудить релокацию – что по визам (особенно сейчас актуально для Штатов), оплачивают ли жилье на первое время – и сколько его этого первого времени, мед-страховка для членов семьи, предоставляется ли контейнер на случай если вы желаете переехать с любимым диваном и коллекцией пивных кружек, помогают ли с сопутствующей бумажной волокитой и поиском жилья.

Tech screening – первая кровь

Компании высшего эшелона – гуглы, фейсбуки, амазоны и подобные им, на первом этапе презренные деньги не обсуждают, однако въедливо скринят по нескольким тематикам (в зависимости от позиции, если программист – то обязательно алгоритмы + предметная область – специфика вашего любимого языка, математика, машинное обучение, сети, потроха линуксов и т.п.) – дабы удостовериться что стоит на вас спускать инженеров. Встречаются и небольшие амбициозные компании которые норовят устроить экзамен на знание акронимов из вашей предметной области.

Подавляющие же большинство предпочитают полагаться на автоматизированные тесты на https://hackerrank.com, https://www.codility.com/ . Обычно это 1-3 задачки которые надо решить (написать код, запустить и удостоверится что проходят все тесты) в условиях неотвратимо цокающего таймера. Задачки варьируются уровнем от чисто алгоритмических (легкие\средние с https://leetcode.com/), до а вот запилите нам простейший блокчейн или каталогизатор фоток. Корректность решения проверяется заготовленными тестами на самой платформе. А как ваш код будет работать если вместо int придет long? А что если в последовательность положительных чисел вклинится одно отрицательное? Тесты вам могут быть доступны все (т. е. вы запустили – система показала какой тест завален – вы скачали данные для проверки), доступны частично (т. е. вы видите что из 100 тестов, 99 failed а почему и что там может быть – додумывайте сами), а могут быть и не показаны вовсе (считаете что все красиво – жмакайте кнопку “я сделалъ”). Таймеры у всех разные – у Amazon было два часа на две нудные длинные задачи с регэкспами, у Yelp 20 минут на одну. Могут быть и вопросы с вариантами ответов как у Optiver или даже практикум на живой виртуалке с линуксом как у Booking.

Некоторые предлагают заполнить форму с вопросами – аля как мы можем использовать технологию X, или программа на языке Y работает медленно – опишите как вы будете ускорять ее. Здесь убивают сразу двух зайцев – проверяют на базовое знание матчасти (вопросы обычно классом поглубже чем поверхностное гугление) и ваш талант кратко и по делу донести свои мысли деловым английским.

Online assessment – онлайн собеседование

Если удалось справится с тремором по поводу гонок со временем – то второй этап либо онлайн кодинг, либо так называемый take home (задачку на дом) и его последующее обсуждение.

В первом случае вас ожидает несколько сессий где вас попросят написать что-то на алгоритмы и обсудить вариации решения, подкидывая на ходу дополнительные условия, чтобы оценить насколько расширяемо вы пишете код. Большинство не парится на тему конкретного языка и заинтересовано в первую очередь в верном алгоритме, исходя из теории был бы человек хороший – а принятый в компании стек можно и освоить. Примеры задач – в добавок к упомянутым https://hackerrank.com, https://leetcode.com, еще можно посмотреть на https://careercup.com. Сложность и количество задач всегда зависит от двух факторов: позиция и ваша синьорность – вполне могут дать одну посложнее вместо пары-тройки легких.

Задачка на дом представляет собой уже что-то более-менее приближенное к реальности – как правило это уже тот стек на котором вам предстоит ударно трудиться. Здесь приветствуются авто-тесты, авто-документация и все best practices применимые к языку системе сборки и документации которые так уютно существуют в самых смелых мечтах разработчиков. Код обычно размещается в приватном репозитории чтобы ваши конкуренты не повторяли ваши ошибки, а делали свои, а работодатели не ломали голову над очередной дилеммой: как бы и кандидата проверить в около-боевых условиях, и, при этом, ему не надо убивать на это более двух вечеров. Далее назначается дата и обсуждаются результаты – то ли вы вообще решали (а у меня был случай что я в попыхах не верно интерпретировал условие), решили ли (опять же был весьма досадный опыт когда поверхностно проверил не на всех наборах данных и пропустил фатальную жирную багу) насколько код готов быть развернут на боевом сервере – много обращений, много данных, низкое качество данных – в параллели заодно затрагиваются детали языка\фреймворка\стиля.

Onsite interview aka face to face

В случае успеха остается последний рубеж. Компании попроще на этом этапе ограничиваются общением с непосредственным руководителем и снова HR – behavioral interview, cultural fit – дефицит кадров еще не значит, что кто-то хочет попрать священное правило “No asshole!“. Некоторые примеры вопросов:

  • Опишите ситуацию когда вы облажались – какие уроки вы извлекли из этого
  • Как вы считаете какие ваши сильные и слабые стороны назвал ваш предыдущий начальник\коллеги\подчиненные
  • Что вы будете делать если почувствуете что у вас напряженные отношения с кем то в команде
  • Что вы будете делать если не согласны с каким-то техническим решением? Что если оно уже утверждено?

Те же у кого финансы позволяют – могут пригласить вас в офис чтобы вы пообщались на отвлеченные темы с представителями из разных команд, воочию посмотрели как проходит обычный рабочий день – как происходит взаимодействие между людьми. Это может внести некоторую заминку из-за необходимости получения визы – особенно в случае если вы не из Москвы.

Hard mode: on

Вдобавок к выше перечисленному если компания большая, а имя ее заставляет рекрутеров делать стойку и дает вам возможность добавлять себе в титул приставку ex-Имеряк’ер – тогда тут-то и начинается самое интересное. Целый день вы проведете в офисе компании. С вами будут говорить такие же технари – представители нескольких смежных отделов. Несколько секций с решением уже знакомых алгоритмических задачек – подтвердить продемонстрированный ранее уровень, но уже в окружении скучающих интервьюров на доске, листе бумаге или хромбуке. Сессия вопросов и ответов (с вас) по специфичным для позиции темам. Специализированные интерактивные секции – (тут вы должны постоянно задавать наводящие вопросы определяя ограничения задач) – спроектировать архитектуру некоторого приложения, затраблешутить багу в проде. Это будет насыщенный, но не самый легкий день в вашей практике.

Далее уже не так интересно – торг по зарплате, обсуждение условий релокации и бесконечная бумажная волокита.

Из забавного

Из-за перепутанных часовых поясов пришлось отвечать на вопросы по телефону во время пробежки с псом, между заходами в парилку в бане и на первой попавшейся парковке из машины.

В ОАЭ всех претендующих на работу на госслужбу, кроме всего прочего, тестируют с помощью платформы TalentQ – первая попытка прохождения заставила меня крепко зауважать всех госслужащих этой страны. Похожий тест – Criteria Cognitive Aptitude Test (CCAT) – просят пройти, например, в Crossover до любого рода технических тестов. Вопросы там разбиты на три подгруппы: логические – найти лишнюю крякозябру в списке, подставить недостающий петроглиф в строку; около математические – на основе таблицы с кучей цифр подсчитать арифметическое среднее или разницу в процентах чего-нибудь с чем нибудь; прочитав пару абзацев текста выбрать основную мысль или чью-то точку зрения – в простых случаях найти антоним или синоним к какому-нибудь заковыристому слову.

В одном из заданий на дом требовалось вытаскивать некоторые публичные данные и, не смотря на тривиальность задачи, сервер упорно плевался таймаутами. Я вообразил хитрую защиту от пауков и с вдохновением принялся рандомизировать агента, развернув сервис в локали, благо он был заопенсурсшен, проверял CORS и куки. Апофеозом был анализ tcp дампа в wireshark‘e и запоздалая проверка traceroute к сайту. Выяснилось что сайт хостился на айпишнике из забаненных диапазонов, а во всех браузерах были vpn плагины.

Иногда по результатам рассмотрения резюме присылали отказ, а спустя пару недель рекрутер той же самой компании зазывал пообщаться.

На дату публикации – самая длительная пауза между откликом и ответом составляет три месяца. Победители в этой номинации запросили целый набор документов (ASAP!11) для того чтобы запланировать интервью.

Самопозиционирование как architect – иногда вводило рекрутеров в заблуждение и предлагались позиции связанные с очередной стройкой века в некоторых странах Персидского залива.

Некоторые выводы и наблюдения

https://www.timeanddate.com/worldclock/converter.php – ваш друг и помощник – когда самая главная головная боль – понять когда у кого сколько времени

Растет число компаний предлагающих удаленную работу по рейтам не зависящим от вашей локации. Задачи наконец-то начинают отличаться от “in-house сотрудники не хотят возиться с унылым легаси” до вполне cutting edge – дизайнить дата пайплайны для финансовых платформ, SRE\DevOps для вполне серьезных хайлоадов, ML – для предсказания цен и рекомендаций в разных отраслях.

Планирование сроков – собеседований и рассмотрения предложений – позволяет задать и себе и компаниям осязаемые дедлайны. Это дисциплинирует.

График собеседований лучше составить так, чтобы сперва для разминки вы пообщались с мало-интересными вам компаниями, а уже потом, на пике формы, переходили к компаниям мечты (но еще не успели выгореть от бесконечных “есть два целочисленных массива … “).

Звонков и вообще всяческой суеты от рекрутеров будет много – они будут пропадать, внезапно звонить спозаранку или ночью, вы вполне можете путать их имена и компании которые они представляют. Собеседование это тоже работа и тут вы, чем сможете, должны им помогать, понять и простить. Хороший рекрутер может помочь вам с согласованием предлагаемого работодателем пакета и выбить комфортные условия релокации, или предложить что-нибудь интересное через год когда вы уже напрочь о нем забыли.

Российские компании (из тех с кем я общался) куда более зациклены на конкретном стеке кроме, пожалуй, Яндекса. Уровень вопросов вполне сопоставим с общемировым. С учетом налогов и даже без учета стоимости жизни в Москве и Питера можно получать больше чем в Европах, даже без менеджерских лычек.

Как ни странно – надо готовится. И тут чудес не бывает – брать Кормэна, смотреть лекции Седжвика и в виме простом блокноте писать на том языке на котором планируется проходить собеседование. Для философских разговоров хорошо заходят Architecture of open source applications про хайлоад – DDIA, для SRE – Sire Reliability Engineering и моя любимая книжка с черепашкой – Операционная система Unix Робачевский.

P.S. И просто побрюзжать

Современные IDE отупляют разработчиков – иначе я не могу объяснить причину появления сонма фундаментальных серьезных руководств как писать программы в текстовых редакторах (без авто-комплита, подсветки, подсказок и возможности скомпилировать или запустить программу чтобы высветился номер строчки с ошибкой).

Я еще помню времена, когда вопросы частенько задавались заведомо не правильные, на знание стандартов C++, undefined behaviour, depend on compiler implementation. Сейчас, к моему удивлению, ряд вопросов служат тому, чтобы выяснить что вы будете делать если нет интернета или стековерфлоу лежит, а с кодом надо что-то решать.

Иногда формализм побеждает – когда спрашивающий общий вопрос хочет услышать свой любимый ответ или занимается проверкой вашей памяти, задавая вопросы на терминологию (многие бы сильно удивились узнав что они используют какие-то из известных паттернов проектирования или определенные нормальные формы в БД). Из запомнившихся вопросов подобного плана – CAP теорема, что значит каждая буква в SOLID или перечислить те самые двенадцать факторов.

Коррелирует ли как-то способность решить задачу про кроликов или свести проблему Пингвин-Авиа к поиску минимального остовного дерева за полчаса интервью с тем как кандидат будет работать – не знаю. Поможет ли вам знание о существовании неявного декартого дерева в работе – сильно зависит от того над чем вы будете работать. Расширит ли алгоритмическая подготовка ваше сознание – несомненно!

Собеседования с вменяемыми компаниями (не ленящихся прочитать резюме, без подсказок дающих обратную связь) – прекрасный способ держать себя в тонусе. 🙂

[docker][cassandra] Reaching mixed load – 750,000 op\sec

The cart goes nowhere because the swan wants to fly in the air, the pike wants to swim underwater and the crawfish wants to crawl backward.

Cassandra performance tuning - challengeCassandra is one of powerhorse of modern high load landscape – nosql database that can be pluged to Spark for distributed computing, Titan for playing with graph data representation and even to Elastic as a search backend.
And if you really care about pure write performance – this is de-facto choice in world of open source solutions: production proof, with big community that already generated many outdated controversial answers at SO & mailing lists.

 

No single point of failure, scalability, high availability, retention periods, … , but those marketing claims hide few principal caveats… Actually, cassandra has only single drawback(*) – it will not reach its limits with default settings on your hardware. Lonely, single node configuration – it is not use case for cassandra, it will shine in multinoded clustered setup.

If you really want to see full utilization of endless cores and crazy amount of RAM, you have to use some virtualisation technology to manage hardware resource.

Let me start first with some conclusions and recommendations, based on extensive two monthes testing and observartion of trouble tickets after migration of this approach to production. With those considerations in mind I was managed to configure it such way to tolerate 750 k mixed operations per seconds. It was generated for more than 8 hours to check pressure tolerance and emulate peak loads. .It was mixed execution of async inserts, without future processing and synced inserts as well as read requests.

Frankly speaking, I am sure it is still far from its limit.

Bear in mind that I am talking about Cassandra 2.1.*.

About disks

  1. Use ssd disks as mapped volume to docker container. Single container = single dedicated disk.
    It is possible to use multiple disks per containers, but it will lead to 5-15 % of slowdown.
  2. If you use ssd disk you can map all casandra directories to it (saved_cache, data, commit_logs) and adjust casandra.yaml with higher values of throughput, in particularly: compaction_throughput_mb_per_sec, trickle_fsync
  3. It is really depends on data distribution and your data model, but be ready that disk utilization will vary from one node to another up to 20%
  4. Docker should be configured to NOT use host’s root partitions. Don’t be mean and allocate single drive for logs and choose proper storage driver – docker-lvm.
  5. In practice, cluster start strugling when any of nodes come out of space. Surprisingly, in my experiments it was stable even with 3% free, but in real life better to configure your monitoring system to give alert at 15-20%.
  6. Choose compaction and compresion strategies wisely when you design your db
  7. Be careful with column naming – it wil be added for every god damn row!
  8. Do sizing when you think about number of nodes (and disks).

About cpus:

  1. More cpu per node is NOT always good. Stick with 8 cores per node.
    I’ve experimenting with single fat supernode per physical server = 48 cores, 4×12, 6×8.
    6 node with 8 cpu cores outperform all others in 6 kind of stress load scenarious.
  2. If you play with core number you have to adjust few settings at cassandra.yaml to reflect that number: concurent_compactors, concurent_reads, concurent_writes.
  3. Cassandra in most cases endup to be cpu-bound, don’t forget to left for host system 8-16 cores, and allocate cpu exclusivly for containers using –cpuset-cpus

About RAM:

  1. cassandra-env.sh have builtin calculation of free memory to adjust jvm settings using analysing results of command free. Ofcourse it is not for docker based setup. Bear this in mind and tweak your startup scripts to substitue values there.
  2. Disable swap within docker using –memory-swappiness=0
  3. Effectiveness of memory usage depend on cpu amount, how effective multithreaded compaction is implemented at Cassandra and what settings for reader\writer\compactors you have at your cassandra.yaml, i.e. you can have hundreds of RAM but endup in OOM. But even with 8 Gb of RAM per node you already can see benefits. More RAM – mean more memtables, bigger key cache, and more effective OS-based file caching. I would recommend have 24 Gb RAM per node.
  4. Disable huge page at host system or at least tune your jvm settings:
 echo never &amp;amp;gt; /sys/kernel/mm/transparent_hugepage/enabled
echo never &amp;amp;gt; /sys/kernel/mm/transparent_hugepage/defrag 

About network

  1. Mandatory to use network stack of host Os using flag at docker –net=host
  2. Most likely network should not be bottleneck for your load, so you can stick with virtual interfaces on top of single real one.

Testing:

  • 3 physical server: each have 72 cores, 400gb ram
  • Cassandra 2.1.15
  • Docker: 1.10
  • Host Os: Centos 7.5
  • Guest Os: Centos 7.5
  • java 8 from oracle with jna

Cassandra 3.* this is competely another story – in my opinion, mainly, because of storage engine changing, but here is a huge list.

DB overview:

  • Dozen keyspaces, each have up to 20(?) tables.
  • Few indexes – just do not use indexes, design schema properly
  • Data replication = 3, gossip through file
  • Each physical server represent dedicated rack within single datacenter.
  • Row cache were disabled at cassandra.yaml i.e. first priority was to focur on write oriented workload

Tools:

  1. datastax stresstool, artificial table – very intresting, but useless, using your schema is very important
  2. Datastax stresstool + your own table definition – nice, give hints of production performance. But you still testing single table – usually it is not the case in real life.
  3. Self written in-house stress tool that generate data according to our data model in randomized fasion + set of dedicated servers for ddos with ability to switch between async inserts (just do not use batches) with and without acknowledgment.
    Once again: no batch inserts as they should not be used in productions.
  4. Probably, you can adapt Yahoo! Cloud Serving Benchmark. I haven’t played with it.

 

That’s it folks, all craps below is my working notes and bookmarks.

How to get c++11 at Centos7 for stress tool compilation:

Install recent version of compiler on centos7: devtoolset-4
update gcc version 4.8 at centos 6: https://gist.github.com/stephenturner/e3bc5cfacc2dc67eca8b

scl enable devtoolset-2 bash

RAM & swap:

How to clean buffer os cache

echo 3 &amp;amp;gt; /proc/sys/vm/drop_caches
check system wide swappiness settings:
more /proc/sys/vm/swappiness

Docker related:

If you are not brave enough to play with DC\OS or Openstack you can find docker-compose to be usefull for manipulation of homogeneous set of containers

Installation:

sudo rpm -iUvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
rpm -qa | grep docker

If you fucked up with partition settings:

wipefs -a /dev/sda1

Docker and mapped volumes: http://container-solutions.com/understanding-volumes-docker/

If docker info | grep loopback show you something – you already screw up configuration of storage driver.

How to check journal what is happening:

journalctl -u docker.service --since "2017-01-01 00:00:00"

Full flags described here.

Usefull commands for check docker images:

docker inspect
lvdisplay
lsblk
dmsetup info /dev/dm-13

Cassandra:

How to check heap memory consumption per node:

nodetool cfstats | grep 'off heap memory used' | awk 'NR &amp;amp;gt; 3 {sum += $NF } END { print sum }'

How to check what neighbors we can see from nodes:

nodetool -h ring

How to find processes that use swap:

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | awk '$2 {print $1 FS $2}'

Check how many disk memory we used, from Cassandra perspective

nodetool cfstats | grep 'Space used (total)' | awk '{s+=$NF} END{print s}'

Determine disk usage, OS point of view:

du -ch /var/lib/cassandra/data/

cassandra check health:

 ssh nodetool status

Network:

How to open port at centos 7:

firewall-cmd --zone=public --add-port 9160/tcp --permanent
firewall-cmd --zone=public --add-port 9042/tcp --permanent
firewall-cmd --zone=public --add-port 7200/tcp --permanent
firewall-cmd --zone=public --add-port 7000/tcp --permanent

Open ports for spark, master:

firewall-cmd --zone=public --add-port 7077/tcp --permanent
firewall-cmd --zone=public --add-port 8081/tcp --permanent

Apply changes in ip tables:

firewall-cmd --reload

or do this, usefull in case network manager behave badly:

systemctl restart network.service
systemctl status network.service

And as bonus points –

How to migrate data from old cluster to bright new one

  1. sstableloader
  2. cassandra snapshots
  3. For tiny dataset to get cql file with inserts: – cassandradump

First two approaches represent standard way of data migration.
Limitation of first is speed and necessity to stop old node.
Limitation of second is necessity to manualy deal with token_ring on per node basis.

If life was really cruel to you, you can play with data folders per node.

NOTE: if you can replicate exact same setup – in terms of assigned ip, it will be enough to just copy cassandra.yaml from old nodes to new one, and use exact same mapping folder within docker as it were at old cluster.

If not – you still can do it with copying data folder follow steps below, but better just use sstableloader.

  1. In order to do it you have to run following command on every node to drain node from cluster and flush all data into filesystem:
nodetool drain</pre>
<pre>

NOTE: this is unofficial, not recommended way to deal with data migration.
NOTE 1: it require you to have similar amount of nodes in both clusters
NOTE 2: no need for the same datacenter\rack cfg\ip address

2. Deploy docker based setup according to HW configuration. Total amount of nodes should be equal to total amount of nodes at old cluster. On new cluster deploy exact schema that were deployed on old cluster.

3. Stop new cluster.
within every node data folder of OLD cluster you would have following folders:
system
system_traces

NOTE: do not touch system* tables.

4. Under folder /your/cassandra/data-folder/your-keyspace
you should have set of folders corresponding to that keyspace under which data is stored.

5. You have to copy content of this folder (*.db, *.sha1, *.txt) for every node from OLD cluster to corresponding folder of NEW node cluster in. UUID WILL be different.
I.e. old cluster, node 1 to new cluster, node 2:
data copy example
scp /old/cluster/cassandra-folder/data/your-keyspace/your-table-e31522b0e2d511e6967a67ec03b4d2b5/*.* user@:ip/new/cluster/cassandra-folder/data/your-keyspace/your-table-c56f4dd0e61011e6af481f6740589611/

6. Migrated node of OLD cluster must be stopped OR you have to use `nodetool drain` for processed node to have all data within sstables ~ data folders.

Performance monitoring:

  • general system overview: atop or htop
  • Be sure that you understand memory reporting.
  • JMX based monitoring: jconsole
  • Jconsole connection string: service:jmx:rmi:///jndi/rmi://:/jmxrmi
  • dstat network & disk io
  • strace – show every system call. slow down. can connect to running.
  • netstat -tunapl | lsof -i -P – network\ports per process
  • docker stats – reports cpu\mem\io for container
  • perf + perf-map-agent for java monitoring:
    for example cache miss, more there:
perf stat -e L1-dcache-load-misses

Articles that I find usefull:

 

*cassandra has only single drawback – it have no idea of your data model, whether you configure your data schema correctly, what is your load patterns. That why you have to dive in wonderland of controversial recommendations in blogposts like that one instead of thoroughly read documentations first.

 

P. S. do not believe anyone, measure!

 

 

How to stop being a junior – 7 hints of programmer productivity

0) If you don’t know something – don’t afraid to ask.

Especially if you already checked first page of google search and pretty sure that no one ask that question at stackoverflow.
Reinventing the wheel and breaking the stalemate can be a good exercise for your home projects, but in production environment better to check idea with your mentor before diving into implementation details.

1) Don’t afraid to show that you have no idea.

real programming - do not left open questions ever

do not left open questions ever

Just add note for yourself to figure out it later.
If you do not understand how something works – obviously this is gap in your knowledge.
And you can’t just skip it – you are software engineer – this is your obligation to be aware what is happening under the hood.

And yes, sometime you have to do it at your own time.

Professional growth is very simple:

  • step 1 – find something that you don’t know,
  • step 2 – start investigation, discover bunch of additional mysteries and repeat step 1

2) Split problem to several simple questions and address them one by one.

Don’t try to troubleshoot huge multi-component system within real data to reproduce the problem.
Forget for a minute for overall complexity of your enormous project, analyze suspicious functions one by one independently of each others.
Use online fiddle for your language to check obscure part of language with fake data returned by mock api.

3) stop wasting your gorgeous time

If you find yourself googling the same commands over and over again – start making notes.
Create txt file with most useful commands and update it whenever you find yourself googling again.
Search for ready to use cheat sheet or even put wallpaper on desktop.

4) Do not stop to invest time in reading proper books. Ever.

Pressure, deadlines, laziness.
This is not for company, boss or to bragging about.
This is your main and primary investment to the future – your knowledge is treasure.
30 minutes of reading every day – not too much,
but in long run – you will be noticed that you become more capable and be able to tackle previously hard to solve problems.

from junior to senior - bite by bite

from junior to senior – bite by bite

5) You should properly digesting advice, articles and opinions.

Always will be people who are picky about your code.
Always will be deadlines where you have to make compromise.
Always will be people who haven’t seen big picture or just too stubborn to like ideas of other.
Be wise and pragmatic:
first and foremost you earn money for get your job done.
focus on that target along the way try to do it concise and efficient but at the same time meet your own deadlines.
New technologies, new languages, new paradigms and new patterns give it a try only when this shit is working.

6) Do not ever ever ever do work that you do not like or not interested in.

You will do it mediocre at best.
Your work should be your passion, your pride and not amount of hours behind the desk exchanged for paycheck.
If you are not happy at the morning before the workday – you have to change something. Urgently.
Excitement, challenge and satisfaction should be your main motivation for every day.
Money and career opportunities always follows three guys above 🙂