I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.
I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.
Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?
Its one things for a model to be very clearly instructed to add a REST endpoint to an existing Django app and add a button connected to it on the front vs "Design me a youtube". The smaller models can pretty dependably do the first and fall flat on the second.
You prompt for what you want it to do, and it will write eg. python scripts as needed for the looping part, and for example use claude -p for the LLM call.
You can build this in 10 minutes.
I don’t use a cloud platform, so I can’t comment on that part. I‘d say just run it on your own hardware, it’s probably cheaper too.
Everyone misses this pattern with skills: you can just drop code alongside a SKILL.md to guarantee certain behaviors, but for some reason everyone's addicted to writing prompts. You don't even need to build a CLI. A simple skill.py with tasks does it. You can even have helpers that call `claude -p`!
couldn't you "just" have it orchestrate a bunch of subagents? a la the superpower skill
definitely a worse solution, non deterministic orchestration + way higher token usage (unless there's a way to hide the subagent output from the orchestrator agent? i haven't used any of these platforms) but could work in some cases
I feel like I'm falling out of whatever is popular these days. I've been using prepaid tokens and custom harnesses for a long time now. It just seems to work. I can ignore most of the news. Copilot & friends are currently dead to me for the problems I've expressly targeted. For some codebases it's not even in the same room of performance anymore, despite using the exact same GPT5.4 base model.
At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent
Sorry, you thought a prompt was a suitable replacement for a testing suite?
Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.
Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?
This is cool. Can you elaborate on it? Is it flaky? Does it take a long time?
You could still use an LLM to write and extend the tests, but running the tests would be deterministic and would use less resources.
AI is being pushed so much at work right now. For non-dev stuff even. The amount of things that people think are "awesome never seen this" is staggering.
Just because you haven't seen file format X converted to file format Y before and now you asked the LLM to do it and it worked, doesn't mean you needed an LLM for it nor that it's remarkable. The LLM knew how to do it because it learned from a bazillion online sources for deterministic converters that cost nothing (and have open source). But now you're paying, every single time, for a non-deterministic version of it and you find it cool. It's magic ...
But I guess they deserve it.
you'll be surprised with how many people are comfortable attributing something they do not understand to Magic.
more than anything, ai let people who couldn't and wouldn't bother to learn to write simple code, to side step ones who can and build solutions to scratch their own itch. that too faster.
now human behavior kicks in, and they don't want to hand control back into the hands of people who can code to solve problems.
put this together and you have a good model to understand the AI sales pitch... Its magic
like all magic, its but a trick.
Technology is no different: someone has never even considered that this thing could be possible, and now they see it with their own eyes? Incredible! They don't realise that its mundane and has been possible (in much cheaper ways) for a long time. It was like a few years ago when some journalist posted an animation showing how Horizon Zero Dawn does frustum culling and all the non-tech people were all "wow! This game unloads the game world when its not in view! Incredible!", like... yeah? That's how games have worked since the advent of 3D?
Looking at the Mythos benchmarks, it doesn't seem like the models are that close to being truly reliable for agentic tasks.
Is it a year away, or five? That's a big difference in deciding what to build today.
So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail.
Probably not explaining it very well but I think it's pretty effective at reducing token usage.
LLM's are pretty good at reasoning about workflows, its just that when they have to apply them directly, the workflow context gets muddled with your actual tasks context. That's why using an orchestration agent that delegates work to worker agents works so much better.
I still think there's a huge amount of value in having the workflow executed in a deterministic way (as code, or by a workflow engine) because it saves tokens, eliminates any possibility of not following it, and unlocks other cool things, like being able to give each step in the workflow its own focused task-specific context, splitting plans into individual actions and feeding them through a workflow one by one, and having workflow-step specific verification.
But that workflow absolutely CAN be created by an LLM, it just shouldn't be executed by one.
"What we're not open sourcing (yet) is the runtime. "
I tell it "write a program that goes over this bunch of files and do this".
Sometimes "do this" can be invoking another claude instance.
Like, most of the stuff needed to make AI better is stuff that could have been written by hand in 2015, so why hasn't anyone used their agents to do so?
To be fair, there is probably a way to make it work the way you want. You could add an MCP for a task queue and let the model work each item in the task queue. The tasks could be added by a deterministic system i.e. your harness.
> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.