upvote
I find it fascinating to give the LLMs huge stacks of reflective context. It's incredible how good they are at feeling huge amounts of csv like data. I imagine they would be good at trimming their context down.

I did some experiments by exposing the raw latent states, using hooks, of a small 1B Gemma model to a large model as it processed data. I'm curious if it is possible for the large model to nudge the smaller model latents to get the outputs it wants. I desperately want to get thinking out of tokens and into latent space. Something I've been chasing for a bit.

reply
Yes - I think there is untapped potential into figuring out how to understand and use the latent space. I'm still at the language layer. I occasionally stumble across something that seems to tap into something deeper and I'm getting better at finding those. But direct observability and actuation of those lower layers is an area that I think is going to be very fruitful of we can figure it out
reply
I'm sure you're aware but it's worth pointing out that you will lose all your cache hit discounts with some providers. The next turn will incur the cost of the whole trajectory billed at fresh input token rates.

As an aside, 95 pages into the system card for Claude Opus 4.6, Anthropic acknowledges that they have disabled prompt prefill.

reply
Yes, I have already made deliberate cache decisions and plan to do more once it's working the way I imagine. I think the trimmed down context will have way bigger effect than the cache stuff, though.

As far as I understand, it's caches are not a "next-turn" thing, but a ttl thing.

I made the "retrieve" tool, which is what pulls back previously removed content, append to the conversation rather than putting it back where it previously was. But it's a but premature to really know if that's a real optimization.

reply
deleted
reply