upvote
Outputting "filler" tokens is also basically doesn't require much "thinking" for an LLM, so the "attention budget" can be used to compute something else during the forward passes of producing that token. So besides the additional constraints imposed, you're also removing one of the ways which it thinks. Explicit COT helps mitigates some of this, but if you want to squeeze out every drop of computational budget you can get, I'd think it beneficial to keep the filler as-is.

If you really wanted just have a separate model summarize the output to remove the filler.

reply
This is true, but I also think the input context isn't the only function of those tokens...

As those tokens flow through the QKV transforms, on 96 consecutive layers, they become the canvas where all the activations happen. Even in cases where it's possible to communicate some detail in the absolute minimum number of tokens, I think excess brevity can still limit the intelligence of the agent, because it starves their cognitive budget for solving the problem.

I always talk to my agents in highly precise language, but I let A LOT of my personality come through at the same time. I talk them like a really good teammate, who has a deep intuition for the problem and knows me personally well enough to talk with me in rich abstractions and metaphors, while still having an absolutely rock-solid command of the technical details.

But I do think this kind of caveman talk might be very handy in a lot of situations where the agent is doing simple obvious things and you just want to save tokens. Very cool!

reply