Chronicles of one crypto-arbitrage bot

TLDR;

Chronicles of one crypto-arbitrage bot

Some time ago friend of mine, innocent in terms of exposure to bloody details of executing mid-term IT projects, sold me yet another brilliant idea of how to effectively decrease amount of leisure time – by diving (again) in horror-land of startups.

It was excited journey with up, downs and sort of Challenger Deeps that empowered some of my beliefs in regards of general principles of software development, made me treat another “rules” in a more complimentary way, and definitely allow to grasp over few interesting tricks.

With this article I want to highlight curious constraints that we have to overcome; real-life uncertainties, shaping my architectural approach and illustrate, how project evolved during growth of code base and changing requirements.

As years went by I tend to think that there is two main approaches for starting your own startup:

  • you try to foresee any possible outcomes, address all edge cases, horrified by volume of work and end up with “Oh, hell, fuck it, no!”
  • you start with “Fuck it!” jump in with some awkward prototype written during the lunch to see whether this shit work at all.

 

Bright Idea

It was early 2017, term ‘crypto’ starts appears in mass media, video cards were out of stocks almost everywhere. ICOs, promising various kind of digital revolutions, wide spreading as bubonic plague during ancient times, raising crazy money in exchange for vaguely compiled pdfs. Gold rush of modern century as it is.

There was bunch of crypto-exchanges offering to trade alt-coins: different subset of trading pairs,
complex rules of fees, volumes of digital “commodities”. It was distributed and (almost) non-regulated market where price at the same coin may be different among exchanges.

Idea was pretty simple:


Deposit money at supported exchange, monitor price difference for some coin, as soon as it exceeds profitability threshold – sell at one exchange, buy at another exchange – just to maximise absolute volume of all coins in all our account’s wallets. Approximated opportunities window (based on manual trade’s execution) sometimes reach up to 15 minutes – i.e. it was possible to send missing coins from another wallet to exchange and trigger necessary order.

All of this still sounded pretty easy – so we agreed on investigation stage: collect fluctuation of prices along the day and analyse them to better understand prospects of project.

At that time I can’t distinct ask from bid, how order different from trade, what is the candle or ticker and how price regulated at all.

Very brief intro to terminology:

Order it is when you register within exchange your desire to buy or sell particular crypto currency. If you want to buy – it called bid, sell – ask.
When someone else decided to put order for the same coin and price be matched – trade(s) will be executed – i.e. coins travel from wallet to wallet, exchange charge their fee, order’s volume will be updated to reflect executed trade, when it become zero it mean that order fully filled.
Volume (amount) it is exactly what you want to trade, but price – depending on exchange, can encapsulate several strategies, most common of them is to define exact price.
It has one issue though – price is changed based on current state of order book – i.e. if no one want to sell something for price that you set – you can’t buy anything.
On practice it mean that if you set price according to current top order, then during period since you click submit – till the moment exchange noticed it – someone else may purchase everything and the next lot in order book would have another price.
That means no matching order – your order may hang, and, depending on exchange rules, may be expired in couple of weeks, depending on price movement.
To overcome this inconvenience another common option is to use ‘market’ price – when you definitely want to participate in trade on best possible real cost (if order matching engine implemented properly).
Ticker – summary, that highlight changes of trades for fixed time period: highest ask, lowest bid for coin pair, and other details that varies from exchange to exchange.
Candle – usually have more verbose information – open-close-high-low prices for time period and volume related info.

First blood

So, yeah, returning to the topic of getting the data from exchange.
I have started looking at public api for chosen exchanges. Can be a great illustration for Noah’s Ark – a lot of very different beasts. In the world of classical exchanges Fix protocol is a common way to get updates from exchanges but even now it is almost not supported in crypto world. Web sockets were complex to use and not available at majority of exchanges. Orkay – it meant we are working through REST. Get data from tickers and save it to look later. A piece of cake!

Initial tech stack and reasoning behind it:

  • I do not expect to have some crazy data not in terms of volume not in terms of operations.
  • Database? But maybe this thing will not fly? Csv files are still very convenient to deal with! (yeah, right)
  • Language – speed of implementation that was what matter the most – so no exotic I-want-to-try something new, no complex power of c++ or scala, no verbosity of java – I just stick with python. 2.7. Because, umm, well, it was installed by default and it has everything as 3? (yeah, right)

Those initial code several python modules and single ugly ipython notebooks probably not even committed, but looking at tickers I can’t believe my eyes – this is too good to be true. Either there were some very core errors in our reasoning (or in the code), or we should abandon all other life activities and dive in implementation of remaining functionality asap.

We will be rich soon!

After discussion decision had been made to start iteratively.

During the first stage I will create prototype application that

  • collect data for further analysis – Arima combined with the most advanced tricks from technical analysis should give us ideal prediction (yeah, right!)
  • trigger notification if observed difference noticeable enough to execute manual trades.

Technicalities:

  • Notifications. It is strange but not all people find amusing digging through log files or read emails. On the other hand telegram has free bot api, mobile and desktop clients with nice UI to simplify analysis.
  • Every exchanges choose to have their own unique naming convention for coins. All right we will create intermediate naming layer and exchange specific conversion routines.
  • Database. Yeah. Probably it is time to have it as collected csv starts exceeds Gb in size after couple of days of running. What about constraints?
    • Hundred writes per second seems to be very safe upper bound (yeah, right).
    • Reads? Only when analysing data (yeah, right).
    • I do know about normal forms, so with proper approach un-voidable schema changes should be cheep.
    • I do not yet know which data dimensions will be the most important for us – so I want flexibility to tweak data tuples based on every dimensions (columns).
    • Implementation language may be changed in future (yeah, right) – so wide choice of _mature_ client’s libraries is definitely an advantage.
    • I have worked with postgres – which seems to satisfy all the above and have a lot of room to go above – thats another reason to use known technology.
    • Just to keep in mind that things still may be changed so some simple abstraction layer that allow to switch database are always good.

When plan is ready it is very easy to type code – no matter that it is after intense working or family hours – pick issue and execute. But… yeah it would be to easy without them:

First obstacles

There is famous quote «Data is modern oil» but exchanges not really keen to share their data – and in case you need granular timed snapshot of market state the single way to acquire it is to collect itself (or buy). Order book may have dozen thousands of bids and asks. If you request it every second – majority of data will be duplicative and its retrieval will be just annoying overhead for exchanges. If every second you want to get info about hundreds of coins – it can be considered as some kind of ddosing. There is a lot of policies how to control end-point abusing: weighted cost for every API invocation and maintaining client’s balance, renewable every quant of time; share quotes of request – per IP or api key; specify global hard limits and as soon as client exceed allowed number of connections\requests throw 429, 503 or even disrupt all communications with offender. It was very interesting to learn such diversity of load balancing approaches by practice – troubleshooting data retrieval processes by playing with timeouts between invocations and spreading applications among several VPS.

All logging related activities are under constant threat of becoming too noisy and be ignored. And it is always very tricky to choose balanced strategy to prevent recipient from creation of special rule for incoming mails to automatically move everything to the trash. As I mention earlier time gap for opportunities window sometimes used to reach 15 minutes, we analyse price every second and arbitrage event may occurs for multiple currencies. As result our telegram channel were bombarded with identical messages making it un-usable (and telegram doesn’t hesitate to ban us for cool down period as well so we have to add some back-off mechanism as well). In order to make navigation through alerts history in telegram channel more simple we introduce naming convention to have pair of exchange and name of currencies as tags. Additionally we have to went through a lot of iterations in regards of alert attributes: severity, depending on size of possible profit; time – our local? (but we were in different timezones), exchange time (but due to their geographic distribution it also be different), UTC from server where we run process. And, yeah, this channel at telegram were muted. For the same reason of huge amount of notification we disregarded idea of adding simple controls to be able to manually confirm order placements.

We decided to store details about observed arbitrage events as well, history retrieval were ready with some precautions for timeout handling and restarting. Now question where actually store data. If you are lean startup and your free EC2 micro instance were expired, but you do not wish to invest time in investigating of alternative options from competitors your option is to have «in-house» infrastructure. I.e. just run `docker pull postgres`, map volume for persistency and to speed up things a bit (just in case for so cold premature optimisation(C)) disable network virtualisation. Now how to access it? Classical setup is following – you have router from internet provider with dynamically allocated public ip address. Depending on your luck and location (country of residence) ip address will be periodically changed by ISP provider. Behind router – in your home private network ip addresses usually assigned by dhcp protocol – i.e. you restart your host where database server is running and it may get another private ip address. In addition to this it is your home network – i.e. NAS with private photos or laptop with work’s projects. There is a lot of various services offering dynamic dns – you have to run special daemon at your server that will update bonded ip address for chosen domain name – I have chosen https://www.dynu.com. Second simple step is to configure your workstation to have statically assigned ip.  Many young hacker’s bundles offer convenient interface over nmap to scan for open ports and check whether by any chance you allow ssh access using login and password.  In order to have at least some protection layer port forwarding were setup using not standard port, ssh was configured to provide access by key file only and old good ip tables allow you to configure banning ip address for failed authorisation attempts. With all this pre-cautions – you can initiate a tunnel with port-forwarding in such way that remote postgres process will be bind to your localhost:1234.

Real programmers test in production

Reactive asynchronous program dealing with real world not so easy to test. And the main reason for it – infinite number of possible combinations of weird situations to happen. Blackout in your home office, un-noticed maintenance of exchange due to recent hacking, issues of internet provider – you name it I saw them all!

First launch with real money went in following way:

  1. we run multiple processes
  2. telegram channel were bursted with endless messages with details of order placed
  3. I saw that telegram started exponentially increase time outs to forward messages from the queue to the channel – i.e. my code do something and it is not visible for us (Oh, shiiiit)
  4. my partner was trying to keep up with message flow – but it was impossible as for every message he have to check two different exchanges filtering by currency.
  5. We need to stop it! Now! Side note: Let’s add to backlog new feature request – we need some mechanism to stop all of them once. At that time I just run pkill to everything with python.

After 6 months of development first 20 minutes of running reveals various issues ranging from annoying to critical.  And in order to investigate what was happening and find out root causes we have to dive in logs and history of orders and trades – situation from bots perspective were quite different in comparison to what exchanges thoughts.

Exchange performance

Kraken – it just doesn’t work. I am still have old script to issue placement of thousand orders and compute ratio of success. At that time it was like 30 something from the whole thousand. At least it was easy to verify and reproduce.

Maintenance of up to date balance state

You have to know how many money you have – as it define scale to what you can trade. In crypto-exchange world there are primary coins – BTC, USDT, ETH – they used for trading as base currency; and secondary – myriads of alt-coins that traded for base. If there are many processes using same wallet it mean that any of them may change remaining volume of base currencies. So issue related to maintaining up to date state of balance was much more trickier. Initially, every arbitrage process sync it independently with every exchange before computing details of order to place. In case of any network issues during balance retrieval – rely on previously retrieved state.

Some of exchanges distinct total and available balance by introducing freezed balance i.e. portion of coins were allocated to registered orders, that does not yet executed. When you deal with response you have to pick proper field to be sure to what extend you can go crazy with trades. Sad thing that not all exchanges actually have it. (Kraken) But anyway I completely missed this bit during implementation – so we have several cases of «Not enough balance» errors during order placement but paired order at another exchange successfully placed and executed – i.e. direct loss. 

Another issue related to balance – burst of requests from the same ip. Balance API is specific for user, and considered as private API with authorisation i.e. more sensitive to load. When price on some coins fluctuating we experience situation that many processes requested it within same time window and response may return with delay or even timeouted. Exchanges start completely ban ip address which was actually fine as no trade were possible in this case because even order book can’t be retrieved. Throwing timeouts, on another hand, was catastrophic as bot have to wait for timeout period (i.e. some one else can place this order) or rely on outdated balance state to place order on one of exchanges, failing to do it at first because of not enough balance.

As we already started experimenting with multi-node deployment – solution to issues above were inject balance retrieval to dedicated process that every second update balance state in redis cache. Every arbitrage bot may access it and check, if last date of update became too old – immediately shutdown itself. After order placement – process forcefully update balance at cache. This approach was optimistic as well – even after execution – it take some (tiny but still may be crucial) time to reflect changes – just humble attempt to minimise time window of uncertainty.

Order book analysis

Initial approach for analysis of current state of market was take the top bid and ask, compute price difference, compare with issue orders, take a look at second pair of bid and ask. But at the same time other players may affect order book – so whatever we are dealing with is already out of sync, and we saw situation when we try to bet on already missing bids, So we decided to do processing of order book only once. Second issue – sometimes price were different, but volume wasn’t enough to get profit. It varies from coin to coin, so we have to pre-compute and maintain volume cap, dependent on most recent coin price and exchange fee, to prevent placement of order leading to loss and use for analysis not only first bid\ask but median of topN of them to approximate possible delay of processing and activity of other players.

Reverse coins movement

Initial idea was just to maximise total balance across all exchanges, but in this case it will be possible (and it happens to us as well) to use all coins from one exchange so just can’t trade it anymore. So we decided to add another mechanism to re-distribute balance evenly across all exchanges by using the same logic as for arbitrage but with lower threshold – i.e. even if it will be zero profit at the end – we still meet our goal. Those logic were triggered for particular currency only if observed dis-balance on pair of exchanges exceed configured threshold.

Busy days

On the rise we have 5 EC2 instances, each running up to 80 processes, history data were stored at Postgres at RDS (bills up to 600$ for this one? Hell no, lets migrate back on self-hosted solutions!). Deals channel at telegram beeped every day. It doesn’t mean we don’t have issues but situation were stable enough to check it every evening for confirmation that everything were fine or re-prioritise backlog.

However first days of live trading brings few obstacles:

  • Logs tend to consume all disk space (yeah, it was very inefficient text file logging) so processes tend to dies on file write (simple logrotate configuration with cron solve it)
  • Deployment also become an issue – manually starting hundreds of processes is a bit overwhelming. Another thing that processes are not daemonized (yeah, I know) in order to have some kind of heartbeat within console as opposite to search in logs by PID and time, so to avoid any issue with SIGHUP they have to be deployed under screen sessions. First we will create new screen, named by pair of exchanges, inside of it we will have dedicated console named by trading pairs to simplify troubleshooting. Same things for other supporting services: history retrieval, balance updating, telegram notifiers, etc. So I drafted few scripts and middleware to automate it within single server. And bash alias to have command to stop processes without screen shutdown.
  • Almost every exchange required to have nonce as part of payload for trading related requests – incrementing sequence of integer number. I have experimented with various approaches but we end up with relying on redis as well to share it across many processes relying on the same api key.
  • Re-balancing initially were run as independent processes and sometimes they compete with direct arbitrage processes, so decision have been made to run them sequentially within the same process

Conquer new heights

  • When we start adding more coins and new exchanges – errors handling become more and more important. On every request we may get in response timeouts or errors (and it actually may mean that order placement were failed, but also it may succeeded!). Errors may also be returned as part of payload – i.e. http response was 200, but payload said error something. Additionally what bot was doing so far was just order placement – i.e. no any guarantee that trades were executed – i.e. manual work required to review open orders and process them. Our solution to both these issue were introducing two priority queue backed up by redis to keep track all placed orders sorted by time. Every queue may be read by one or multiple consumer process that check status of order with the exchange and properly process it.
  • From time to time speed become an issues – we noticed that not always can issue order in time python 2.7 doesn’t have support for async so I have added gevent and pool request, not so convenient to deal with and it doesn’t completely eliminate issue – as we are still operate on top of snapshot of order book
  • Architecture of module, probably, partly a bit verbose, but proof itself to be flexible enough to fuse new exchanges with their quirks in terms of new exchanges’s API integration and addition of new currencies.

Problem of floating point arithmetic in python.

Rounding is a known pain in the ass in many programming language (https://en.wikipedia.org/wiki/Round-off_error) and python not an exception here. Apart of situations when we sell more or less than expected, there were other cases when it affected us. Some exchanges are very strict in cases when you submit volume value with more than expected precision – when API expect up to 6 numbers but receive number with 7 digits after decimal point – the whole order will be rejected (and paired order at another exchange may be executed as they may have more relaxed rules). Those rules are specific to exchange, vary from currency to currency – i.e. you can have 0.00001 of BTC but not for some obscure coins. And, guess what, those rules may be changed – so you have to monitor for any ‘precision’ errors on order placement. Another annoying thing – implicit fallback for scientific notation for string representation for small numbers. I do understand meaning of 1e-6 but usual api is not so smart. In order to overcome all those issues it wasn’t enough just to switch on Decimal class but the whole new layer of conversions have to be implemented to cut redundant part of string representation based on exchange specific rules.

Brightest prospects

Many development activities happens outside core module:

As size of database grows we start thinking about encapsulating news feed and various forum’s rumours to checks whether it correlated with fluctuation of prices. Inspired by this aim independent project were started to create web-crawlers for twitter accounts, telegram groups, medium posts, reddit threads and few more specific resources to get all information within some date-time range. Every new source of information brings new surprise – for example in order to work with telegram you have to build native extension to use java wrapper through JNI. In order to efficiently get tweets with comments for curated accounts we have to maintain set of keys and rely on key rotation mechanism as soon as api start throwing timeouts. Important aspect there – time handling. World are big, complex and have many time zones which sometimes may be reflected with web-page or response in all its cheer differences, so we have to adjust everything to common denominator – UTC – as we have done for market data.

Another direction was to simplify bot management & deployment to decrease operational toil. As we already started use redis for sharing common information (nonce) and maintain alerting messages and order’s queue, next steps was to extend this mechanism to the next level – build sort of message bus for inter-process communication. Other important functions are: automation of deployment of services to a new node in a bit more advanced way that just cloning the whole repo, update configuration for particular server, shutdown all processes at all registered nodes at once. In order to flexibly manage newly added EC2 instances to the system we add ability to deploy agents. It is a special daemon that control arbitrage and supplementary processes at node, listen commands from UI of command centre. From monitoring perspective – it would be much more handy to review state of processes by skim through colour of trading pairs per servers reflecting delays of heartbeat, than ssh’ing inside screen of every server to watch for heartbeat messages in console. Many challenges were overcome on this path as well: flask & dash integration, dynamic UI creation (I am not the biggest fan of frontend) with ability to forward input from newly created forms into redis.

The fall of Rome

And suddenly something changed – just in one single day we no longer were able to place orders in time for few currencies. Either some one plays with scalping (which, in theory, exchanges tries to prevent), or someone were able to prepare efficient ML model to predict price movements within second intervals, or someone just can do it faster. Even now it is not the case for all currencies for all exchanges – as individual or institution have to pull a lot of money to cover everything, but it affects profitability so direct arbitrage approach probably will not bring you too many money. The last attempt that I have made was to switch trading on web-socket basis. Idea was very simple – you pull snapshot of orderbook and subscribe for updates to sync and then maintain actual state locally. Subscription – means event loop in dedicated threads. Asynchronous code with threads not ideal use case for python 2.7 so I devoted good amount of time to add option to show in real-time top bid and asks from both order books for debug purpose. But several trial sessions with this approach doesn’t reveal too many opportunities for profit and show general instability of subscription itself that may lead to re-syncing and potentially error-prone.

Lessons learned

  1. Exchanges – just another piece of software with all related issues of big complex systems – reliability, consistency, performance. So you have to deal with it with all possible precaution: stale orderbook, delays with balance updating and order placement, errors management.
  2. Vps is not real hardware – you still share resources with other processes. First time I saw at htop metrics steal time!
  3. Python for mid size projects – great speed of development, rich ecosystem – google it and use. Be careful as you never know how that guy implemented those function that you need inside library – if you care about performance – try to minimise external dependencies as much as you can.
  4. Implementation of some builtin algorithms may be not as you expected – check set intersection or insort.
  5. It is useful to use some weird default values: -1, 100500 – when you stumble across them in logs you (almost) 100% sure that something went wrong.
  6. If not sure – better just shutdown process.
  7. Shell programming – ancient mortal art that can save a lot of time for supplementary task for example to analyse logs.
  8. Bugs related to lack of types — stupid and annoying errors that requires long time to troubleshoot
  9. Code indentation – when by mistake block of code moved inside or outside of cycle or condition – unique «feature» of python – also require thorough investigation and not so easy to find
  10. python and date
  11. python and float
  12. cost of function call may be high – inlining may lead to code duplication but may be faster

4 comments

Skip to comment form

    • Ipnon on September 18, 2019 at 5:13 am
    • Reply

    I learned so much.

    • coogan on September 18, 2019 at 3:09 pm
    • Reply

    total profit / loss ??

      • admin on September 18, 2019 at 10:20 pm
        Author
      • Reply

      Best month – around 10k profit.
      Worst month – zero, as we tend to stop operations till we fully understand root cause.

    • David on September 19, 2019 at 3:32 am
    • Reply

    Fascinating read! Congrats on being able to move quickly and carpe the diem.

Leave a Reply

Your email address will not be published.