This is the exact point I make whenever people say LLMs aren't deterministic and therefore not useful.
Yes, they are "stochastic". But you can use them to write deterministic tools that create machine readable output that the LLM can use. As you mention, you keep building more of these tools and tying them together and then you have a deterministic "network" of "lego blocks" that you can run repeatably.
OTOH, there's loads you can do for evaluation before a human even sees the artifact. Things like does the site load, does it behave the same, did anything major change on the happy path, etc etc. There's a recent-ish paper where instead of classic "LLM as a judge" they used LLMs to come up with rubrics, and other instances check original prompt + rubrics on a binary scale. Saw improvements in a lot of evaluations.
Then there's "evaluate by having an agent do it" for any documentation tracking. Say you have a project, you implement a feature, and document the changes. Then you can have an agent take that documentation and "try it out". Should give you much faster feedback loops.
Another thing that get quantized is video preferences to maximize engagement.
Larger composition, though, starts to run into typical software design problems, like dependency graphs, shared state, how to upgrade, etc.
I've been working on this front for over two years now too: https://github.com/smartcomputer-ai/agent-os/
> once they unlock one capability,
What does it mean to unlock? Its an llm nothing is locked. The output is a as good as the context, model and environment. Nothing is hidden or locked.