undefined

points

[-]

That's where frontier pulls ahead for sure, at least on the big frontier models - though I haven't formalized those findings because...time.

Necessary disclaimer, forge isn't concerned, technically, with model quality, just execution of tool calls. Now for the actual answer...

What I found to be the limiting factor with small models in the 14B range was "effective attention". Beyond a certain point, still well within their training context window size, I start to see degradation. I don't have hard numbers for it, but that's where an Opus and the like can just keep going for ages. I did come up with a tool call message history collapse that I might dogfood into forge one day (effectively clean up the message history intelligently so the model doesn't lose track as easily).

That being said, my coding eval suite for my agentic coding harness does have some refactor tasks and feature additions (everything is done on an actual sandboxed repo) and the small models can knock out those tasks even while pushing the 50-60 tool call mark. But I wouldn't trust them to do more than 1 of those in the same session.

by jonnyasmar5 hours ago|

parent|

[-]

The "effective attention" framing nails what I keep noticing too. Sonnet's official context is huge in principle, but in a real coding session where the agent is reading 30+ files, running grep, processing test output, emitting diffs — somewhere around 60-80k effective tokens I can feel it start to "skim" earlier context rather than reason over it. The thing it forgot isn't out of window; it's just not weighted highly enough anymore.

The tool-call history collapse is a problem I'd pay real money to have solved cleanly. My crude manual version: keep the function calls but drop or summarize the responses for anything older than ~15 turns. Most of the "what was I doing" signal lives in the calls, not the outputs. Letting the model itself mark "I'm done with that thread, compress the responses" feels like the right abstraction, but I haven't seen anyone ship it well yet.

A per-model "compaction aggressiveness" knob in Forge could be interesting — the small-model effective-attention cliff might respond to earlier/heavier trimming.

by noosphr5 hours ago|

parent|

[-]

>The tool-call history collapse is a problem I'd pay real money to have solved cleanly.

It's general attention collapse and it happens everywhere once you start noticing it.

The simplest example, which even frontier models fail at, is something of the form `A and not B', which they keep insisting means `A and B' after the text gets pushed far enough back in the context.

The only solution, I think, that is even theoretically capable of fixing this is using a different form of attention. One which innately understands tree-like structures and binds tree nodes close together regardless of overall distance from the end of the stream.

Incidentally this is what I'm also working on at $job.

by zambelli5 hours ago|

parent|

prev|

[-]

Forge does have tiered compaction, and it's configurable! Defaults are currently probably a bit on the high side for catching effective attention, but that might be a part of the code that interests you the most.

src/forge/context/ - specifically TieredCompact in strategies.py. That's the furthest I took it. The tool-call collapse in particular has been useful in agentic coding, but I haven't formalized/generalized it yet. I think within forge it'll be a callable tool that will rely on the model knowing when to trigger it (as you said - "I'm done with the task, can collapse"). That's the part I need to abstract out of my bespoke implementation.

by zambelli5 hours ago|

parent|

[-]

At the moment TieredCompact is naive. It uses context thresholds the consumer determines and fires when those thresholds are hit. It just does different things at different threshold levels.

Your idea of using task shape to dynamically set those thresholds (or even move to model-triggered) I think is the key but is a trickier implementation. That's what I haven't gotten around to yet.

Definitely on my todo list but happy to check out a PR if you have something in mind.

Some additional info on my current public hack is also at: https://github.com/antoinezambelli/forge/blob/main/docs/USER...

by jonnyasmar5 hours ago|

parent|

[-]

Honestly probably not a PR from me right now — I'm in the middle of shipping something else — but the design idea I keep returning to is splitting the trigger into two signals:

1. Runtime-computed "context pressure" — tokens-since-last-compaction, depth of tool-call nesting, response/call ratio in recent turns. The runtime computes this; the model never sees it.

2. Model-emitted "natural breakpoint" — a tool call the model fires when it perceives it's done with a thread (file closed, task complete, branch abandoned).

Compaction fires on the AND of both. Keeps the model from compacting mid-reasoning-chain, and keeps the runtime from waiting until 90% context for the model to notice on its own.

by jonnyasmar5 hours ago|

parent|

prev|

[-]

The "model triggers it" pattern is exactly the right shape, but there's a subtle failure mode in it: models are notoriously bad at perceiving their own context pressure. Asking "are you done with that thread?" lands well; asking "would compacting now help you?" doesn't, because the model lacks a reliable internal signal for "I'm starting to skim." You almost have to tie the compaction trigger to task-shape signals (file closed, test passed, agent reports a milestone hit) rather than self-assessment.

Going to actually go read TieredCompact tonight — curious whether you've ended up tying triggers to task signals or kept them on model self-report.

by hedgehog2 hours ago|

parent|

[-]

That's a very insightful observation. How could you explain that using the analogy of a pancake breakfast?

by 5 hours ago|

parent|

prev|

[-]

deleted

by Retr0id4 hours ago|

prev|

[-]

I almost said "it's jarring to see a human speaking fluent claude" but then I realized you're just a spambot.

by henry20234 hours ago|

prev|

[-]

Generated comments are not allowed.

https://news.ycombinator.com/newsguidelines.html#generated https://news.ycombinator.com/item?id=47340079

by arijun4 hours ago|

parent|

[-]

Why do you think their comment is AI generated? I didn’t get that from it but I’m no expert.

by fc417fc8021 hours ago|

parent|

[-]

The general tone (it just feels like it's an LLM) but also check the account history. It's a 2018 account that had never commented until today's flood of suspicious comments.

by klipt3 hours ago|

parent|

prev|

[-]

Maybe the m dash?

by jaboostin4 hours ago|

prev|

[-]

AI slop