I am doing something similar where I have a parser which looks for changes in documentation, matches them with the GraphQL schema and generates code using Apollo. In a nutshell it is a code generator written using Claude to generate more code and on failure goes back to Claude to fix the generator and asks a human for review.
I’m not going to trust a scripted codegen without any logic fo such thing as api integration
The post provides a lot of good food for thought based on experience which is exactly what the title conveys
> We chose the second because we didn’t want to overfit our assumptions.
> Some of it went better than expected.
> But they also broke in very unexpected ways, sometimes spectacularly.
You clearly missed the whole point of the article, which is to experiment with agents and explore the limits of having them run wild.
Efficient use of tokens and which tasks to delegate is secondary to the experiment. Optimizing these is in any case premature if you don't understand the limits of the models.
I think you completely missed the point - they built a product purely using agents and deployed it to production for others to use. Read what the product actually does first.
What evidence? There is 0 evidence. It's deployed to production, but that doesn't mean it works fine or is free of bugs - which is exactly my point and why you use algorithms for these types of things. They're testable, repeatable and scalable.
With LLM slop it's just that - slop.