upvote
I am a solution engineer mostly on the traditional ML side of things but have good knowledge of K8S/GKE. The most fun I had last year was helping a customer serve their models at scale. They thought it was cost prohibitive (500k inferences/second and a hard requirement of 7ms at p99) and so they were basically serving from a cache which was lossy (the combinatorial explosion of features made it so that to have full coverage you needed exabytes of ram) and was stale prone. We focused on the serving first. After their data scientists trained a New pytorch model (small one, 50k parameters more or less) we compiled to onnx (as the model is small and CPU inference is actually faster), grafted the preprocessing layers to the model so that you never leave the ONNX C++ runtime (to avoid python), and deployed it to GKE. A 8 core node using AMD genoa cpus managed to get 25k/inferences per second. After a bit of fiddling with Numa affinity, GKE DNS replication, Triton LRU caches and few other things we managed to hit 30k inferences per second. If you scale up to the traffic it would cost them few thousands per month, which is less than their original cache approach.

Now they are working on continuous learning so that they can roll out new model (it is a very adversarial line of business and the models get stale in O(hours)). For that part I only helped them design the thing, no hands on. It was a super fun engagement TBH

reply