As a result it's really hard to read about real-world use cases online. I think a lot of people would love to hear more details - at least I know I would!
Also for certain use cases there are constraints like embedded hardware systems with no internet access. These LLMs have to be trained to specialize for clearly defined use cases under hardware constraints.
Frontier LLMs also are rarely function in isolation instead are orchestrating a system of special units aka subsystems and agents.
While costs and effort are one thing, being able to downsize these monster LLMs through finetuning itself in the first place is extremly valuable.
I am not expert in this topic, but I am wondering if large cached context is actually cheap to run and frontier models would be cost efficient too in such setting?
There might be future optimizations. Like, have your small model do COT to find where to look for memory that is relevant.
I've tried too. Wasted a few days trying out even high end paid models.
Unless your game states have combinatoral exlosion, would it not be better to generate all of that pre-build? If templated you can generate a few hundreds of thousands of templates to use for any circumstance, then instantiate and stitch together those templates during the game runtime.
I dunno, for game prose I expect that a tiny highly quantized model would be sufficient (generating no more than a paragraph), so 300MB - 500MB maybe? Running on CPU not GPU is feasible too, I think.