hype wave, pragmatic evidence vs need to move fast
Tech side of startups sometimes can be very fluid and contain a lot of unknowns. What tech stack to use? Which components might be overkill for now but worth keeping an eye for in the future? How to balance the pace of business features development while keeping the quality bar high enough to be able to have a maintainable code base?
Here I want to share our experience of building https://cleanbee.syzygy-ai.com/ from the ground up – how we shape our processes based on needs and how our processes evolved as we extended our tech stack with new components.
Business want to conquer the market, engineers – try cool stuff and stretch their brains. Meanwhile industry produces new languages, frameworks and libraries in such quantities that no way you will be able to check them all. And, usually, if you scratch the shiny surface of NextBigThing you will find a good old concept. Good – if you are lucky.
One of the most exciting topics to argue about – is the processes – whether you rely on trunk-based development or prefer more monstrous github flow, are you fan of mobbing or find it more efficient to spend time in PR-based code-reviews.
I have experience working in an environment where artifacts were thrown away on users without any standardized process, and, in case of issues, developers had a lot of fun (nope!) trying to figure out what version of components was actually deployed.
On another spectrum is the never ending queue to CI – after you create PR you have to entertain yourself in the nearest 30 min by betting whether CI cluster will find a resource to run tests over your changes. Sometime the platform team introduced new, exciting and certainly very useful features, that might break compatibility with existing boilerplate for CI, that resulted in failing all your checks at the last minute, after an hour of waiting.
I have a strong belief that as usual it all depends – on team maturity, kind of software you are building and various business constraints a-la existence of error’s budget and importance of time-to-market versus SLXs.
I think what is really important – have some agreed process in place, that everyone is aware of and follows, as well as having balls to challenge and change it, if there is evidence that there is a better alternative.
Start shaping the process
What we have at the start:
- less than dozen developers – in house-team and temporary contractors
- who want and can work asynchronously
- completely greenfield project – no single line of code is written yet, requirements a big vague, but already started shaping into something
- tech wise – clear need for backend that should talk with mobile clients
- and some simple web frontend – static pages should be enough! (nope)
We have started simple – code at github and PR based flow with single requirement – to have tickets splittable in order to be delivered in 1-3 days. This required some practice of story slicing and it seems that sense of visible fast progress ~ ability to move tickets to
Done, can be a great motivational factor for the team to onboard of that idea.
Linters and static analyzers to skip exciting discussions a-la how many arguments per method are too much (6!), gradually adding auto-tests. We also try codesense – they have very promising approach to highlight important part of code (those bits that changed frequently – should definitely have a higher maintainability bar!) and identifying complexity by looking at the level of nestness in the code, probably it is a bit expensive for startups in initial stage – but 100% provide a decent level of hints for engineers.
On the architecture side of thing – there were temptation to dive deep in wonderland of microservices. but looking at horrifying diagrams of connections between them from big players, need to trace requests between them – it really seems suicidal approach for teams on early stage, that want to move fast.
Analysis of requirements allow us to detect three groups of job:
- core API with usual CRUD like activities
- search and recommendations
- temporary workload that do something very useful according to schedule (almost at time with casual delays are ok)
Choice of tech stack – situations when time a bit limited and expectations are high – use what you know and master (yeah, maybe for some one it is boring technology): hence Fastapi, REST, stateless, python, redis and postgres our best friends (yeah, we like Go and Rust – but need to pay one’s dues a bit more!).
With mobile clients the situation was a bit different – we foresee a lot of screens with states, interactions with remote services but not too much custom, platform specific tweaking – hence the idea of having a single code base for both iOS and Android was very appealing.
Nowadays the choice of frameworks is really wide – but again, due to some experience with flutter we decided to give it a go. Within mobile development – one of the important aspects to better decide on is state management – and here you will have a nice abundance of the acronyms to be puzzled about from various languages and frameworks – MVC, MVVM, VIPER, TCA, RIBs, BLOC, … . Our moto – start with the most simple (*) solutions sufficient to support necessary functionality. (*) Simple – well, lets put it this way – we think that we understand it.
However, we definitely make a mistake after building MVP – decided to build on top – instead of throwing it away. Hence at one wonderful (nope!) sunny day I was questioning my sanity: when after I commented out code, clean all possible caches and still doesn’t saw my changes in new screen. Yeah, dead code should be removed!
After those initial formalities were settled, the next thing that was necessary – to be able to check client-server interactions.
API contract is for sure a great thing – but it will be much more obvious that something is wrong when you have a real server throw on you “schema validation error” or miserably fail with HTTP 500 error code.
Backend services were initially split into two groups – API monolith and Search & Recommender. First contains more or less straightforward logic to interact with DB, second contains CPU intensive computations that might require specific hardware configuration. Every service – its own scalability group.
As we were still thinking about rollout strategy (and arguing which domain to buy) – solution was simple: to minimize struggles of mobile engineers of dealing with backend = alien stack – lets pack everything into docker.
When we prepare everything to be deployable locally – mobile engineers can run docker-compose commands and have everything ready (after a few painful attempts that reveal flaws in documentation – but the real value of such exercises is to react to every “WTF!11” and improve it).
`Everything` is good, but what is the point of an API running on top of an empty DB? Manually entering necessary data – shortly start leads to depression (and risk to increase duration of development cycles). Hence we prepared a curated dataset that was inserted into local DB to be able to play with. We also started using it for auto-tests. Win-win!11 Auth becomes less problematic in defining testing scenarios, when you have dozens of dummy users with similar passwords!
Try new things or choosing 3rd party providers
Dealing with new technology is always a bit dangerous – you and your team can’t know everything (and sometimes things that you think you know can full you, but that’s another story). And still it is often required to assess, investigate something that no one has touched.
Payments, email, chat, sms, notifications, analytics, etc – every modern application usually represents a bunch of business logic glued with a number of 3rd party providers.
Our approach to choosing with whom we work – time-capped try-to-build-with-it activities to try the most promising one chosen by features, supported languages and, in case of providers, pricing.
How did we got into terraform?
Backend, apart of DB also should have some object/file storage. Sooner or later we also should have DNS so our services are ready to play with the big cruel world.
Choice of cloud provider was again purely based on existing expertise within the team – we already use AWS for other projects – so we decided to stick with it. For sure it is possible to do everything in the AWS console – but as times go, things become a classic big ball of mud that everyone is terrified to touch and no one remembers why this bit exists at all.
Okay, seems the paradigm of infrastructure as code can be handy here.
Based on experience with terraform… you already got the idea how we choose things? 😀
Yeah, initial setup will take some time (and without control can easily become the same big ball of mud in TF as well 😀 ) but at least it will have some documentation over infra and visibility WHY it is there. Another major advantage – whatever you manage through TF – will be updated automatically (well, when you or CI/CD run corresponding commands)
For AWS itself given we run everything inside AWS we can rely on iam and assumed roles by attaching necessary policies to our VMs. But we need integration with 3rd party services, as well as some way to pass some secrets to our apps – for example password for db. We need some solution for secret management. AWS have KMS, github actions have its own secrets, and apart of it there are bunch of other providers, so real question is: what do YOU need from secret management:
- path based access
- integration with Kubernets
- ability to issue of temporary credentials
- Web UI
- secrets versioning
- … ?
KMS was very handy, and we managed to add it into github actions but the UI of vault and ability to use it for free (if you run it by yourself) was a kind of deal breaker on this matter.
Path to Kubernetes:
And once we have dockerized app – we have started considering Kubernetes as it offers few goodies out of the box – the most important one is to be able to spin up necessary amount of pods to meet performance demands and ability to define all your needs in declarative fashion – so, given sufficient level of automation no human being should run kubectl apply. AWS has EKS to start with that can be managed via terraform.
On the other hand – steep learning curve (to grasp the idea that it is exactly defined what should be up and running) and a bit of specific tooling to play with – were the fair reasons to think about it twice.
If we talk kubernetes, and already have apps in docker that are released on every merge to main – helm chart become next steps in adaptation of modern infra stack: we have plugged AWS ECR to keep track of every new releases and publish helm chart in dedicated S3 bucket, that become our internal helm chart registry.
Plugging it at all together was not as straightforward as expected – kubernetes nodes initially can’t connect to ECR and pull necessary docker images, terraform modules (aws-ssm-operator) intended to work with secrets in AWS KMS was deprecated and didn’t support recent kubernetes API, secrets and config maps wasn’t in mood to be exposed into pods.
First rollout of services brings happiness to mobile folks – no need to care about instructions for local setup! Initial week or so though it was not really stable, but then – one less thing to care about.
Do you need all of it? Not necessary.
I must admit – this mix – kubernetes with vault via terraform & helm probably not for everyone and you most likely will not need it on the initial stage. Simple docker push to ECR on merge to main and doing ssh into ec2 && docker pull && docker-compose stop-start during release from CICD – can work well (at least for a happy path) AND will be clear for everyone from the first glance. That’s exactly how we re-deploy our static websites at the moment – ci build new version of it and just copy into corresponding s3 bucket.
Maturing the infrastructure
AWS is nice enough to offer credits for those who are crazy enough to explore shady paths of the startup world. Can we use it to save a few bucks on github minutes and expose less secrets and infras to github VMs?
How about self-hosted runners – i.e. when you open PR it is not Github VMs but your own Kubernetes allocate pod to run your CI checks? Sure, it is not easy to prepare everything for iOS releases (more about it below) but Android and backend surely should work on old good Linux?!
Observability and Co
There is a lot of marketing fluff around terms like monitoring & alerting.
In some companies those things are implemented just for the sake of bragging “We have X for that!”, although engineers are still blind to what is happening with their production, when there are real issues or alerts channels have to be muted as it contains non actionable noise.
And I must say – here we are still having a looong way to go.
First thing that you will find as soon as you search for those kind of solution is ELK stack and bunch of paid providers. Now, after measuring time and efforts to maintain our own setup – I start thinking that a paid solution might be really worth it. If and only if you really can delegate burden of squeezing the most important info about your apps and state of infra to existing solutions – it is all depends whether they have preset of metrics, log parsers and index mapping that you can easily adapt for your project.
For logging currently we rely on ELK. Yeah, it is more or less straightforward to setup and most likely there are people who find the query language of elastic very convenient to use on a day to day basis.
Here we are still exploring options – as it seems that old good
kubectl logs with
grep produce insights for questions like “What is the last error from app1 pods ” in a much more timely fashion, without being lost among endless UI controls. But most probably in the UI of Kibana still hide the levers that we should pull to add proper ingestion pipeline and choose corresponding mapping for elastic index for filebeat?
For alerting we setup prometheus and integrated it for the slack – again mainly due to the fact that we have experience with it beforehand.
Now, why with all of that we need Azure?!
As it usually happens when product evolving – new requirements introduce new kind of things
- now apart of having something publicly visible we need some resources available for team only
- to manage feature flags, access vault UI or struggle with elastic to figure out the last API error.
Sure, there are paid solution for that or you can mix some Identity as a Service providers (Azure active directory) for authentication your team mates with any VPN providers (we choose OpenVPN due to their free tiers) by exposing necessary services to internal network only so those who should can login using their credentials – and it has one clear advantage in comparison with using AWS stack – it is free (for limiting number of connections).
Okay, why do we need google cloud!?!
So far we are mainly discussing the backend part of things. But there are more. The thing that you see first – mobile apps! Flutter or something else – they also have to be builded, linted and tested. And published somehow somewhere, so stakeholders can immediately be in awe of the new features (and find new bugs).
For rolling out to the production – you would need to pass through a bunch of formalities (screenshots, change log = whats new, review) that will delay your audience from enjoying those pieces of art.
I must say that the API of stores is not really friendly for frequent release – when your build and sign app – publishing can easily take 15+ min. API of app stores – as every other API – may and will fail sooner or later. Yeah, and signing might be a nightmare as it is different between platforms. And it would be REALLY nice if engineers didn’t waste time on all of those things preparing releases from their laptops.
First (and probably single?) thing that you should consider is fastlane – initially I did have some prejudice with all those new terms like gems (like that name though!) and bundle, but it really works. Yes, in order to run them from CI some efforts will be required to deal with secrets jks for Android or match for iOS.
Towards the “dark” side
Next you will start thinking about app distribution: testflight is a handy tool for iOS world, what about Android? We endup using App Distribution – solution from Firebase – mainly because it worked for us after first try, but there are other options (that actually claim to be working for both platforms). What is important – you can do everything from fastlane! Even when your app evolving and you start adding various extras – analytics, chats, maps, geo, … – many of them were from Google directly of Firebase. As Firebase offering many goodies – it was natural steps to collect analytical events and after few tweaking with their IAM policy setup export of raw events into gs-buckets to be able to play with BigQuery.
Prod vs Staging – The Great Split!
For backend we have auto-tests right from the start and various practices like test double prove quite efficient to prevent regressions even in cases of complex business logic with integrations from side services. On the mobile side we were a bit limited due to coexistence of code from MVP and auto-tests were not so helpful for complex business scenarios like someone wanting to use our services but we can’t charge his bank card.
Manual testing was very time consuming and error-prone, especially when business logic dynamically evolved over time and state of data in the database after recent updates became become non-possible from the point of view of domain rules.
Yeah, so it would be nice to be able to run e2e tests clicking through the app with data that we are maintaining (and sure that it is valid). Would be REALLY nice if those tests don’t pollute the actual database, S3 buckets and 3rd party providers.
We have started with a single main branch and a single environment (rds, redis, k8s namespace and s3) that was used by first testers and developers. We were not exposed to the public, but as we move closer and closer to release it becomes clear that some kind of distinction is necessary in places where we can break things and have a stable environment.
In mobile applications it was matter to change URL of API during building. On backend – few aspects have to be done to support deploy-specific configurations: infrastructure-wise by creating dedicated policies and resources, and parameterized few bits in the code where specific URLs were expected. Apart of it there are several repositories, some of them independent but some are dependent – as in cases of shared functionality.
You know what will happen when you update shared functionality without immediate redeployment and testing all dependent apps? After a few days when you completely forget about it, you do some innocent – purely cosmetic changes, somewhere else in the dependent repo that will lead to redeployment and pull the latest dependency. Surely, during important demo right after it you would see some stupid errors related to lack of compatibility for single condition “that no way can happen” and you forget to double check.
- So first important consideration for spliting environment – automate overall rollout of all dependent applications if some base repo was updated, you may ask the team to do it and everyone agrees but forget to run pull.
- Second aspect – what do we actually need to deploy? Do we need to maintain all apps in every environment, including temporary jobs that are responsible for sending email or notifications? Seems that some kind of flag to include or exclude job into deployment might be helpful.
- E2E, and, later, probably Staging not necessary should be reachable by everyone on the internet.
- promoting new release to e2e and staging have to be automating
- promoting new release to prod, at least now, better have controlled and manual
Currently we have three envs, which fulfill all the things above:
- E2E – environment where integration tests will be run on curated data to ensure base functionality is still in place
- Staging – where core development is happening and where beta testers can try to break what we build
- Prod – that happy to greet new users
Kubernetes cluster is still a single one – everything was split on the namespace level. Similar thing with RDS – where several databases co-living together in RDS instance.
On the side of automation of mobile testing choice is not really big. First thing that you have to choose – are you going to use any device-in-the-cloud provider or run tests by yourself? For sure you can plug a smartphone into a laptop and run tests – but wouldn’t it be nice (and right!) if CI will do it instead? When you start considering vendors that provide emulators and real devices to play with, you will find that the choice of testing framework for mobile is not really wide – but that the second choice you have to make (and choice of provider might limit you here). Another important consideration – is there any specific hardware requirements – i.e. using gpu or npu – hence any emulator was sufficient for us.
We identify two main options for the mobile e2e testing framework – flutter integration tests and appium based pytests. Firebase Test Lab were supporting flutter integration tests, although it required some tweaking to allow requests from their ip ranges (VM with running emulators) to reach our E2E API. Appium, apart of python api was very promising, as using something like a testproject (you guys rock!) – you can record all the clicks through the application per scenario – hence doesn’t require specific programming knowledge (but allow you to gradually learn it though). So far Appium is much more comprehensive in our setup in terms of scenario coverage.
E2E tests have one tiny (nope!) issue – cold-start of app in an emulator not very fast, if we add on top of it the time necessary to build the app and the duration of copying the debug build to the provider – it becomes a real bottleneck of moving fast.
So far we experimenting with running them twice in day – but lets see how is going.
Many interesting tasks are still on our todo list:
- on infra side – performance testing, security testing, trying out flutter for web
- on the development side – serving and updating ML models for recommendation engine and prediction of cleaning duration, building cache with feature vector for recommendation, intermix of optimisation problems for matching engine: jobs scheduling and game theory, …
And the most important – nothing can replace real world usage – many crazy things you will be able to see only when you start collecting real data about user’s behavior – so we are looking forward to the upcoming launch!