Depending on how you measure "improvement" they already have or they never will :-/
Measuring capability of the model as a ratio of context length, you reach the limits at around 300k-400k tokens of context; after that you have diminishing returns. We passed this point.
Measuring capability purely by output, smarter harnesses in the future may unlock even more improvements in outputs; basically a twist on the "Sufficiently Smart Compiler" (https://wiki.c2.com/?SufficientlySmartCompiler=)
That's the two extremes but there's more on the spectrum in between.
you can also execute larger tasks than this using subagents to divide the work so each segment doesn’t exceed the usable context window. i regular execute tasks that require hundreds of subagents, for example.
in practice the context window is effectively unlimited or at least exceptionally high — 100m+ tokens. it just requires you to structure the work so it can be done effectively — not so dissimilar to what you would do for a person
How to organize code like you said, and how agents interact with it, to keep the actual context window small is the fundamental challenge.
I looked at that response by GP (rgbrenner) and refrained from replying because if someone is both running hundreds of agents at a time AND oblivious to what "context window" means, there is no possible sane discourse that would result from any engagement.
Doesn't change my point: the amount of code the agent can operate on is very large, if not unlimited, as long as you put even a little bit of thought into structuring things so it can be divided along a boundary.
If you let the codebase degrade into spaghetti, then the LLM is going to have the same problem any engineer would have with that. The rules for good code didn't disappear.
It's like like if your context window with one agent is n, your context window with 10 agents is n/10. It is some skill, but that is also where a lot of the advances are coming in.
Assistant: “I propose A”
User: “Actually B is better”
Assistant: “you’re absolutely right”
User: “actually let’s go with C”
Assistant: “Good choice, reasons”
User: “wait A is better”
Assistant: “Great decision!”