Machine Learning at the edge

Few learnings on migrating computer vision pipeline from python prototype at AWS to cpp in smartphones

One of the projects that we have been working was Smart Mirror – we want to build a technology that will hint user how to looks better at the photo.

It has bunch of fascinating tasks to tackles – from philosophical – what is “better” and what does this “better” mean for particular user to more close to earth issues like scale-independent facial expression comparison between frames. So we decided to split it on several stages to roll out one by one and started with application that show realtime hints how to looks similar to the best photo that user choose by himself.

Currently, under the hood, our computer vision pipeline has 10 neural networks in place and can chew 10-20 frames per second – depending on smartphone’s hardware.

In this article I want to share not only what we are doing at Syzygy AI, but mainly highlight our journey of bringing machine learning from cloud to smartphones.

Data Science\AI is still a hot buzzword to attract attentions to what are you doing, but statistics shows that majority of prototypes doesn’t reach productions. Maybe something dangerous lurking behind curtains of golden cage of Jupiter notebooks?

When we started exploring idea – we started simple – lets run MVP on laptop and use webcam to grab video frames to play with!

We need real-time processing – speed, speed, speed – so lets use our skills in low level programming and do everything in cpp and maybe cuda!

There are mature libraries that you can utilize to plug required algorithm. And there are a lot of pre-trained models (pay attention to a license!) that can be comparably easy added together to sketch a prototype. If the thing that you want to do is already have been done – you just run face detector and after that face landmark detector and then … .

Couple of weeks after – time to share results with stakeholders – early feedback loop is important. So how they can check it out? Well, that’s easy – clone the code, make sure that you have a cmake and required dependencies installed…

You asking what is clone? Do you have a nvidia GPU card? No, we can’t prepare app for iPad yet. Ummm, yeah, let us think about something else!

We have measured time of frame processing and the biggest contribution to duration was inference time of models (and most likely IO to and from GPU) – hence pipe was re-written in python for the sake of development speed. Cloud providers were happy to share VM with powerful GPU – just pay money, it is easy! – so we don’t depend on user’s hardware anymore! Let’s prepare some simplistic frontend and expose UI via web page (but still work with web camera).

It was significant breakthrough for shaping the initial idea and, along the way, it reveals two issues:

1. Usually, ready to use model perform their best when the data, on which we run inference are similar to what was used for training. Hence studio quality portraits of white male in their 40s with beards might not be the best training set for use cases when your target audience is girl-teenagers taking selfi in environment with non-uniform lighting conditions. If what you want to do is not standard cat or dog classifier – you better be aware what ML is really about – digging deep into papers and be able to experimenting fast.

2. When your software is running in the cloud – it can bring all kind of surprises from pallet of distributed system challenges – from cpu steal time to client side data buferization before you flush it to the socket. And of course you have to setup compatible version of drivers and libcuda*.so libraries so your python code will work. Ah, yeah, hardware there might be a bit outdated in terms of computing capabilities in comparison with recent version of consumer video cards. As a result, classical dilemma of “It works on my machine” might be contradicting with user experience at clouds.

As for first point – solution was obvious (not simple though!): We need data. Specific data. So let’s crawl! With our own custom annotation – thanks to CVAT – you can go crazy there.

But second point poses a fair question – what options do we have in regards of productization of our solution. Or, to put it simply – is there any alternative for complex and expensive client-server system with meaty GPU servers?

At the end what we want – run dozens of neural networks with different architectures (few example of tasks that we are solving: facial landmark detection, head pose estimation, segmentations, lighting condition estimation) and classical image manipulations (resizing, cropping, filtering) with close to real time requirements.

It was time to look up – what actually we have inside modern smartphones? Can we offload some work to them? At least to decrease network IO?

The first bit that grab our attention – comparison of performance between Intel i5 and Snapdragon backed laptops: tldr; qualcom was surprisingly fast.

The second is AI benchmarks for mobile hardware – from brief view it is challenging to understand how to treat those numbers, but the fact that there is benchmark for mobile was quite intriguing!

Hence we dive deeper:

Let me add here exact extract from one of Jira ticket from 2020:

Mystery of TFLOPs: surprising power in our pockets

Snapdragon is not a single CPU but SoC – i.e. system on chip i.e. it contains various components.

Snapdragon 865

  • CPU: Kryo 585, Up to 2.84 GHz, 64-bit, instructions set ARMv8-A NOTE: derivative of ARM’s Cortex-A77 and Cortex-A55
  • GPU: Adreno 650, OpenCL 2.0 FP, Vulkan 1.1, OpenGL ES 3.2, DX12
  • DSP: Hexagon 698
  • RAM: LPDDR5, speed: 2750MHz


  • 2019 A13: architecture – A64 – ARMv8.4-A, six-core CPU, 2 cores at 2.65 GHz + 4 energy-efficient cores
  • 2020 A14: architecture – A64 – ARMv8-A, six-core CPU, 2 up to 3.1 GHz + 4 up to 1.8 GHz, LPDDR4X


  • end of 2019: Kirin 990: 4 ARM Cortex-A76 + 4 ARM Cortex-A55 Cores, Mali-G76 M16
  • 2020: Kirin 9000: 4 ARM Cortex-A77 up to 3.13 GHz + 4 ARM Cortex-A55 Cores up to 2.05 GHz, Mali-G78 MP24


  • Q4 2019: Exynos 980: 2 ARM Cortex-A77 + 6 ARM Cortex-A55 Cores, LPDDR4
  • 2020: Exynos 990: 2 custom up to 2.73 GHz + 2 ARM Cortex-A77 up to 2.5 GHz + 4 ARM Cortex-A55 Cores up to 2 GHz, Mali-G77 MP11, LPDDR5X, dedicated Dual core NPU

But Cortex-A* itself is just a instructions set – that implemented by particular CPU. GPU – is also clear, but what the hell is DSP\NPU\TPU\VPU\?!

There is class of neural networks – convolutional one – that rely on heavy matrix multiplication operations. Often used for tasks related to computer vision. If matrix is not two dimensional but three – is cube or higher it is called tensor.

As many tasks on devices are related to images – manufactures decided to create a specialized compute unit ~ ASIC – to efficiently perform those operations

In regards of difference between GPU & TPU:

TPU is very limited in terms of supported operations (and as we learn later actually to input size) – memory organization is very simple, while GPU have much higher set of supported commands and memory hierarchy with caches. Modern higher end smartphones usually have dedicated NPU units, so it is worth to explicitly specify corresponding capabilities of top tier devices (as of 2020).

Kirin: NPU is split to two compute parts: 1 Lite + 1 Tiny

  • Da Vinci Lite features 3D Cube Tensor Computing Engine (2048 FP16 MACs + 4096 INT8 MACs), Vector unit (1024bit INT8/FP16/FP32)
  • Da Vinci Tiny features 3D Cube Tensor Computing Engine (256 FP16 MACs + 512 INT8 MACs), Vector unit (256bit INT8/FP16/FP32)

Snapdragon 865: 15 TOPS

For comparison (*) – characteristics of accelerators used for AI tasks:

  • Google Coral: 4 TOPS
  • JETSON Xavier NX: 6 TFLOPS (FP16) & 21 TOPS (INT8)
  • JETSON Xavier AGX: 5.5-11 TFLOPS (FP16) 20-32 TOPS (INT8)

(*) Nice numbers! But what they really mean?

In computing type of data (8 bit integer or 32 bits float) on which you operate really matter – as you might have a hardware tuned to perform operations with specific data types very fast. Or not.

Hence a bit terminology is necessary:

  • TOPS – terra operations per second ~ throughput.
  • MAC = Multiply–accumulate operation.
  • FMA (fused multiply–add) – most of modern hardware architectures uses instructions for operations with tensors. FMA computes: a*x+b as one operation. Roughly GMACs = 0.5 * GFLOPs

Bonus point – that you can’t directly compare those numbers between different hardware as a part of synthetic nature of them, there are other factors drastically affecting performance – related to data coping back and forth and capabilities of other hardware, that post or pre-process those data.

As in many such cases – it was a good reason to just try it – so we prepare Raspberi Pi with Google Coral and fun is started.

ML frameworks for mobile – is there really a choice?

First of all, let distinct two use cases:

  • inference – when you have a model and want to run it on device
  • training – when you do not have a model (but want to have one!)

Majority of mobile ML frameworks doesn’t support training on device.

There are wide majority of different devices – read as different hardware – top tier devices have NPU and GPU, and low end sometimes can rely on CPU only. Presence of necessary hardware doesn’t necessary mean that software layer can properly utilize it in place. I.e. that is YOUR duty to make sure that app will run in one way or another on user’s device.

There are number of vendor specific library and SDK that can help you to accelerate specific computation on vendor specific hardware. With one tiny caveat – they all works in the following fashion:

you create a model -> then convert to framework specific format -> then you should deploy at device framework specific runtime, that know how to squeeze extra operations from supported hardware underneath:

Obviously, you usually want to have your app installed on as many devices as possible instead of having list of supported processors.

One of possible workaround here – prepare set of models tailored for different devices – so when app start you can ask that existential question – “Where I am running and what I have access to” and pull from your model’s zoo what can is the best fit and fallback to CPU only in case there are no optimized versions available.

There are open source – TVM – and proprietary solutions that try to do exactly it – take your model and optimize it for different hardware.

Sometimes, it is not necessary even take that thorny route – if what you want to do via ML is semi-standartish – probably you can try your luck with semi-ready solutions from industry: ML Kit, Create ML, MediaPipe. Some of them can serve your models via API and you can also try to train them by submitted labelled data: Firebase ML,, MakeML.

If you are keen to explore what options, you have for vendor agnostic setup to run it completely offline choice is really not so big:

What is the golden rule of choosing tech stack for important project? Use things that you are familiar with! That’s how we started with tensorflow lite.

Dark art of model conversions

First things first – you can’t just run your keras model in TF lite – you have to convert it to tflite format.

And guess what? – after conversion it will be another model. With another precision and recall.

Because what convertor actually do – it took a computational graph of network and try to throw away redundant operations and replace operations with those that supported by TFLite interpreter.

Term “supported operations” are ambiguous one – do you remember somewhere above I’ve mention that hardware may support operation but software not? Well, opposite is also may happen (but we will get to it in a minute)!

In theory, conversion should be straightforward but in practice, if your model is somewhere advanced you may need to drop first or last layers if they not supported or dive in model refactoring to help convertor doesn’t screw up your model characteristics too much.

When model is finally converted – nothing is finished actually – you now need to make sure that

  • it can run on real device
  • it produces outcome more or less close to what your keras model did

Actually, you have an option to run TF Lite interpreter on your laptop\desktop, however it is non-optimized for such case at all and inference time might contribute to your procrastination heavily, but you will be able to get a first grasp of actual quality of inference.

Another false assumption that we had – that we can test converted model using available single board computers(read Raspberi Pi) and TPU that support Tensorflow Flow Lite (read Google Coral). Initially idea was promising – friendly linux system is up and running, compatible CPU architecture – ARMv8. But characteristics of CPUs much lower than modern phone have: less cores, no mobile GPU and performance of TPU units seems to be overkill in comparison even with best ordinal GPU – i.e. we can’t reasonably assess what we can or can not do on the real clients.

But the main thing – and lesson we learned hard way – when something is running on Raspberry with Coral – it doesn’t necessary mean it will run on smartphones.

Damn, okay, I have a python and cpp code, TF Lite model and two devices with Android and iOS, what’s next?

This bit might be surprisingly easy (*) – you just need to build a corresponding bench tool from Tensorflow source tree:

(*) easy, if you keen to get familiar with bazel as a goto tool for building monorepo, like flexibility of adb and do not mind to tweak XCode project – for example to exclude emulator from target.

Not sure that I find iOS tool very useful as it working as a black box – do not allow you to specify where exactly you want to run your model. But what was useful outcome of that exercise – to learn that at iOS you are not run tflite model directly – after start of app it gets converted to CoreML format and those model will be used for inference.

On the other hand, with Android benchtool you can directly explore what will happen if you try to run your model at NPU\GPU\CPU.

And it will reveal sad truth:

ERROR: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors.


ERROR: NN API returned error ANEURALNETWORKS_OP_FAILED at line 3779 while completing NNAPI compilation.
ERROR: Node number 536 (TfLiteNnapiDelegate) failed to prepare.

else you can find something more specific at logcat:

07-17 15:44:42.210 14380 15006 I ExecutionPlan: Device qti-dsp can't do operation MAX_POOL_2D

In some cases – it can be even worse – i.e. it supports some operations at NPU, some at GPU, and remaining at CPU.

INFO: CoreML delegate: 41 nodes delegated out of 315 nodes, with 46 partitions.

From my past experience working with GPU and more recent within Spark and distributed computing – I remember one crucial thing – IO time – to transfer data to accelerator\compute node – can neglect and drastically decrease performance of such computation.

I.e. it might be faster to run everything on CPU than run portion of computations at CPU and another portion at GPU.

Another interesting bit is “delegate”: let’s say that your network has some operation – i.e. ReLu. TF lite can try to use its own code to express it via simple math operations, or if you are lucky (?) – you can delegate those computations to optimized library that know how to use processor’s instructions to increase performance. So you have NNAPI delegate for Android, CoreML delegate for iOS that try to choose best hardware for your operations, or dedicated delegates for GPU or CPUs.

In order to overcome conversion-compatibility hiccups – there are no straight path to success – it is really matter of trial and failure: when you use different format (saved model – or h5 – or pb – just or protocol buffer) resulting model might be different (very!).

I have heard funny stories where people have to run chain of conversions to get compatible operation set: Pytorch -> Onnx -> CoreML and specify particular version of opset or doing amazing things like tweaking operation from a - b into a + (-b) that make model successfully run.

In our cases we have to give up ideas of using NPU for computations – first of all because of operation’s compatibility issue – i.e. for example 2D max pooling in NNAPI works only with input tensor with rank equal to 4 and data layout have to be those that NNAPI implementation expect.

Hence you can’t prepare single model and be sure that it will run on all devices. Efforts required to tune model for particular NPU not (yet?) overweight its possible speed benefit. GPU though show itself quite impressive options – in our benchmarks, on average, we observe ~ x5 time speedup in comparison with CPU. Now majority of our network run successfully at GPU (well, at the moment we do not have bug reports related to it!)

But when model run – it doesn’t necessary mean it going to work!

All those delegates have to support operations in your network and, with all optimization in place, must return similar results – otherwise accumulated deviation of precision can lead to wrong class label at the end – that was exactly case that bite us along the journey.

Which imply that we should not just run model at device with random noise input to test compatibility and performance, but run it on the data from test set to compare results.

And in order to do it some more interactive application can be quite handy.

But how actually run it – i.e. you have business logic in Java\Kotlin or Swift\Objective-C or maybe flutter \react native on the client and there python code with actual pipeline that is a bit more than just run this model?

Our approach was to embed everything to old(?) good(?) C++ library – i.e. single code-base – single entry point for clients to interact with ML functionality.

Single code base – was a good idea in theory, but in reality it leads to few consequences:

  • you have to compile it for target arch – welcome to the world of cross-compilation!
  • in order to use it on device – you also have to link it with all necessary dependencies – i.e. Tensorflow lite and OpenCV and their 3rd party dependencies!

Did I mention that in Tensorflow repo Bazel is a first class citizen and old (?) good (?) CMake not so supported so we have to tweak a bit Tensorflow (and sadly it is not everything that was necessary – didn’t have a chance to prepare remaining bits) itself to make it happen.

After some fun with Docker – we have prepared an image with all what was necessary to use it in our CI for tests.

And then start paying attention to actual benchmark results and model size. Usually they quite related.

For sure you are aware of TensorRT that can be handy for model optimization on Cuda device, for the edge you still can utilize it to prepare model before conversion and apply similar tweaks for model optimizations:

  • Quantization
  • Model Pruning
  • Knowledge Distillation

What was really breakthrough for us – quantitative aware learning, that helps narrow down model size from 150 MB to ~ 1 MB without noticeable accuracy loss.

While we are still arguing with UX designers in regards of user flow through the app –  we didn’t touch topic of model on device protection (encryption, obfuscation) – which definitely something that cloud solutions doesn’t have to deal with. However, given everything what is written above – probably people who can understand from the structure of model and disassemble c++ code from inside package of mobile app – how it should be used – better should work with us?

Leave a Reply

Your email address will not be published.