upvote
There’s a lot of wasted compute with stateless inference. We set out to solve that with a new computational model for transformers, and only process the delta between requests. That’s how we can achieve crazy low latency tool calling with LayerScale. Check it out https://layerscale.ai and technical whitepapers out next month!
reply