I did some experiments by exposing the raw latent states, using hooks, of a small 1B Gemma model to a large model as it processed data. I'm curious if it is possible for the large model to nudge the smaller model latents to get the outputs it wants. I desperately want to get thinking out of tokens and into latent space. Something I've been chasing for a bit.
As an aside, 95 pages into the system card for Claude Opus 4.6, Anthropic acknowledges that they have disabled prompt prefill.
As far as I understand, it's caches are not a "next-turn" thing, but a ttl thing.
I made the "retrieve" tool, which is what pulls back previously removed content, append to the conversation rather than putting it back where it previously was. But it's a but premature to really know if that's a real optimization.