undefined

upvote

points

by 827a14 hours ago |

upvote

by DrewADesign14 hours ago|

[-]

I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank.

I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.

reply

upvote

by cogman1013 hours ago|

[-]

> However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it.

I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.

Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?

reply

upvote

by devin8 hours ago|

[-]

And then you realize that what you’re using the smaller models for is ALSO decomposable and part of it is just a few if statements, and then you realize that for this feature you don’t actually need or want a model because the performance, reliability, reproducibility are cheaper and better for you and your users.

reply

upvote

by jimbokun8 hours ago|

[-]

So you have the model write the if statements and put itself out of a job.

reply

upvote

by aleqs5 hours ago|

[-]

Indeed, I've been experimenting with agent workflows, for complicated tasks - where I essentially have a graph of agents with different roles/capabilities, including such things as breaking down complex tasks into simpler ones. There seems to be a point where a complex enough task is better performed by a group of cheaper agents/models than by one agent using one of the SOTA big models, in terms of both quality and cost.

reply

upvote

by tempest_11 hours ago|

[-]

It is also interesting because you get people with very different use cases arguing about the effectiveness of various models but doing very different things with them.

Its one things for a model to be very clearly instructed to add a REST endpoint to an existing Django app and add a button connected to it on the front vs "Design me a youtube". The smaller models can pretty dependably do the first and fall flat on the second.

reply

upvote

by pishpash14 hours ago|

[-]

Aren't they just buying time to build you whatever harness you need? They want to be the only software engineering shop in the world.

reply

upvote

by user3428312 hours ago|

[-]

The designing and implementing of a code harness in your workflow can be as simple as running something like /skill-builder.

You prompt for what you want it to do, and it will write eg. python scripts as needed for the looping part, and for example use claude -p for the LLM call.

You can build this in 10 minutes.

I don’t use a cloud platform, so I can’t comment on that part. I‘d say just run it on your own hardware, it’s probably cheaper too.

reply

upvote

by fny6 hours ago|

[-]

Secret: "compile" that orchestration prompt. Determinism is solved by turning prompts into code that can in turn run agents or run code or both.

Everyone misses this pattern with skills: you can just drop code alongside a SKILL.md to guarantee certain behaviors, but for some reason everyone's addicted to writing prompts. You don't even need to build a CLI. A simple skill.py with tasks does it. You can even have helpers that call `claude -p`!

reply

upvote

by krzyk1 hours ago|

[-]

Could you elaborate what does "compiling orchestration prompt" mean?

reply

upvote

by Frost1x18 minutes ago|

[-]

When you get some abstraction working you concretize it in something deterministic, or sort of “cache” that knowledge bit (aka write me a function, class, library, whatever). In the future, the nondeterministic path now has a deterministic piece to lean on as it explores the problem space. Rinse, repeat, eventually you have a mostly deterministic system now. Leave flexibility in space where you need that nondeterminism.

reply

upvote

by throawayonthe43 minutes ago|

[-]

a guess but i think they mean take the orchestration prompt and prompt yet another llm to turn that prompt into code..?

reply

upvote

by throawayonthe45 minutes ago|

[-]

> But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc)

couldn't you "just" have it orchestrate a bunch of subagents? a la the superpower skill

definitely a worse solution, non deterministic orchestration + way higher token usage (unless there's a way to hide the subagent output from the orchestrator agent? i haven't used any of these platforms) but could work in some cases

reply

upvote

by bob102911 hours ago|

[-]

I saw a major uplift in performance after I combined tools like apply_patch with check_compilation & run_unit_tests. I still call the tool "apply_patch", but it now returns additional information about the build & tests if the patch succeeds. The agent went from ~80% success rate to what seems to be deterministic (so far). I don't bother to describe the compilation and unit testing processes in my prompts anymore. All I need to do is return the results of these things after something triggers them to run as a dependency.

I feel like I'm falling out of whatever is popular these days. I've been using prepaid tokens and custom harnesses for a long time now. It just seems to work. I can ignore most of the news. Copilot & friends are currently dead to me for the problems I've expressly targeted. For some codebases it's not even in the same room of performance anymore, despite using the exact same GPT5.4 base model.

reply

upvote

by woeirua13 hours ago|

[-]

I have but one upvote, but yes. The only way to make these systems work reliably is to break the problems down into smaller chunks. Any internal consistency checks are just going to show you that LLMs are way less consistent than you’d expect.

reply

upvote

by rdedev13 hours ago|

[-]

I had to create a hypothesis testing agent where it gets a query like "is manufacturing parameter x significantly different this month than last month" and have the agent follow a flowchart to run a statistical test and return the answer

At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent

reply

upvote

by julianlam11 hours ago|

[-]

> This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating.

Sorry, you thought a prompt was a suitable replacement for a testing suite?

reply

upvote

by zapataband110 hours ago|

[-]

hey man it works great barely and also costs a bunch of money everytime we run it. we also can't trust the results, relax.

reply

upvote

by deadbabe9 hours ago|

[-]

If you are invested in AI stocks, this is the way. You are basically funneling money from software companies into your brokerage account. Keep going.

reply

upvote

by andy12_2 hours ago|

[-]

Isn't this already possible to implement with skills and subagents? Like have a skill saying "to test these files run this script that executes a subagent for every markdown file, then check the results".

reply

upvote

by mmis100014 hours ago|

[-]

> This started breaking down after ~30 files.

Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.

reply

upvote

by swores14 hours ago|

[-]

Is Codex willing to do "multi hour" tasks when used with a ChatGPT Plus subscription, or does it need something more expensive like Pro?

reply

upvote

by dns_snek4 hours ago|

[-]

It's going to work the same regardless of how much you pay, but with Plus you'll run into 5h usage limit rather quickly unless your "multi hour task" spends 90% of the time just waiting around for code to compile. Expect to get an hour or two of active work (single-threaded).

reply

upvote

by shivnathtathe2 hours ago|

[-]

If you have any org email, you can get free chatgpt + subscription.

reply

upvote

by dnh4413 hours ago|

[-]

I regularly get codex to do multi hour tasks with a single prompts I don't think thats a big deal anymore. But you don't want a single agent doing all the work. The root agent needs to delegate the work to sub agents. For example, a sub agent for context gathering, then one for planning, then one (or more) for implementation, then another for review. This way the root agent doesn't use up its context window and it just manages from a bird's eye view. I do have the $200 plan though.

reply

upvote

by jiehong4 hours ago|

[-]

This might be inherent to how the models are benchmarked.

Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?

reply

upvote

by andyferris4 hours ago|

[-]

Good point. We need the mean, “any 1 of 10” and the “all 10 of 10” success rates in the metrics, so we can estimate reliability (the last one).

reply

upvote

by krashidov7 hours ago|

[-]

> We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.

This is cool. Can you elaborate on it? Is it flaky? Does it take a long time?

reply

upvote

by cheshire_cat9 hours ago|

[-]

Wouldn't it be more efficient to convert the requirements these 200 markdown files into Playwright tests?

You could still use an LLM to write and extend the tests, but running the tests would be deterministic and would use less resources.

reply

upvote

by tharkun__9 hours ago|

[-]

This type of thing so much.

AI is being pushed so much at work right now. For non-dev stuff even. The amount of things that people think are "awesome never seen this" is staggering.

Just because you haven't seen file format X converted to file format Y before and now you asked the LLM to do it and it worked, doesn't mean you needed an LLM for it nor that it's remarkable. The LLM knew how to do it because it learned from a bazillion online sources for deterministic converters that cost nothing (and have open source). But now you're paying, every single time, for a non-deterministic version of it and you find it cool. It's magic ...

But I guess they deserve it.

reply

upvote

by gofreddygo8 hours ago|

[-]

> It's magic

you'll be surprised with how many people are comfortable attributing something they do not understand to Magic.

more than anything, ai let people who couldn't and wouldn't bother to learn to write simple code, to side step ones who can and build solutions to scratch their own itch. that too faster.

now human behavior kicks in, and they don't want to hand control back into the hands of people who can code to solve problems.

put this together and you have a good model to understand the AI sales pitch... Its magic

like all magic, its but a trick.

reply

upvote

by dkersten2 hours ago|

[-]

Oh, yes! As someone who has dabbled in card tricks, this so much. People don't understand how its done and can't imagine or conceive of a way that it possibly could be done, so they attribute it to literal magic or demons or whatever. Like, no, I just distracted you for a split second and used sleight of hand.

Technology is no different: someone has never even considered that this thing could be possible, and now they see it with their own eyes? Incredible! They don't realise that its mundane and has been possible (in much cheaper ways) for a long time. It was like a few years ago when some journalist posted an animation showing how Horizon Zero Dawn does frustum culling and all the non-tech people were all "wow! This game unloads the game world when its not in view! Incredible!", like... yeah? That's how games have worked since the advent of 3D?

reply

upvote

by 8 hours ago|

[-]

deleted

reply

upvote

by awongh12 hours ago|

[-]

The other part of the question is exactly when the "build for the capabilities of future models" becomes the present.

Looking at the Mythos benchmarks, it doesn't seem like the models are that close to being truly reliable for agentic tasks.

Is it a year away, or five? That's a big difference in deciding what to build today.

reply

upvote

by Joeri14 hours ago|

[-]

You could have a skill that is the combination of a minimal markdown file and a set of orchestration scripts that do the deterministic work. The agent does not have to “run everything”, it just needs to know how to launch the right script.

reply

upvote

by sharperguy13 hours ago|

[-]

So I wonder, if a more powerful agent harness could have the agent basically write and exectute its own deteministic code, which when executed, spawns sub agents for each of the subtasks?

So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail.

reply

upvote

by tonylucas13 hours ago|

[-]

I've been working on an integrated deterministic/agent integrated system for a few months now. It basically runs an AI step to build a plan, which biases towards deterministic steps as much as possible but escalates back to AI when it needs to (for AI only capabilities or deterministic failures) so effectively (when I perfect it, I'm about 90% there) it can bounce back and forward as needed with deterministic steps launching AI steps and AI steps launching deterministic steps as needed.

Probably not explaining it very well but I think it's pretty effective at reducing token usage.

reply

upvote

by dkersten2 hours ago|

[-]

I've been building a workflow engine for agent orchestration and the workflows are just data for the engine to execute. While I haven't experimented with it yet, I envision that an LLM would be rather good at generating the workflows based on a description of your needs (and context about how best to utilise the workflow engine).

LLM's are pretty good at reasoning about workflows, its just that when they have to apply them directly, the workflow context gets muddled with your actual tasks context. That's why using an orchestration agent that delegates work to worker agents works so much better.

I still think there's a huge amount of value in having the workflow executed in a deterministic way (as code, or by a workflow engine) because it saves tokens, eliminates any possibility of not following it, and unlocks other cool things, like being able to give each step in the workflow its own focused task-specific context, splitting plans into individual actions and feeding them through a workflow one by one, and having workflow-step specific verification.

But that workflow absolutely CAN be created by an LLM, it just shouldn't be executed by one.

reply

upvote

by shripadt12 hours ago|

[-]

[flagged]

reply

upvote

by peyton10 hours ago|

[-]

I make codex do everything through a giant `justfile`. Simple, greppable, self-documenting, works great, and I don’t even need to read it.

reply

upvote

by sroussey14 hours ago|

[-]

I’m working on a hybrid system of old school task graph and ai agents and let them instantiate each other. I think others will do that eventually.

reply

upvote

by tonylucas13 hours ago|

[-]

I'm working on something similar (won't link to it as don't want people to think I'm spamming) but if you want to compare notes happy to talk.

reply

upvote

by cluckindan14 hours ago|

[-]

Jira for agents?

reply

upvote

by werrett6 hours ago|

[-]

c.f. Linear for Agents

https://linear.app/agents

reply

upvote

by crsn13 hours ago|

[-]

Our team at Agentforce recently open-sourced our solution to this and we've gotten very valuable feedback -- would love to hear from more of you about it: https://github.com/salesforce/agentscript

reply

upvote

by zapataband110 hours ago|

[-]

No you didn't

"What we're not open sourcing (yet) is the runtime. "

reply

upvote

by otikik3 hours ago|

[-]

I never tell claude to "go over this bunch of files and do this".

I tell it "write a program that goes over this bunch of files and do this".

Sometimes "do this" can be invoking another claude instance.

reply

upvote

by imtringued3 hours ago|

[-]

I'm personally surprised by this too. Like, everyone is writing how insanely productive AI is making them, but that productivity doesn't seem to have translated into any innovations beyond model quality.

Like, most of the stuff needed to make AI better is stuff that could have been written by hand in 2015, so why hasn't anyone used their agents to do so?

To be fair, there is probably a way to make it work the way you want. You could add an MCP for a task queue and let the model work each item in the task queue. The tasks could be added by a deterministic system i.e. your harness.

reply

upvote

by pishpash14 hours ago|

[-]

Can you not have it write your harness for you, or have it be the first step? You can push your own determinism where you need, surely.

reply

upvote

by svachalek13 hours ago|

[-]

True. The prompt reads: Run the following Python: ```

reply

upvote

by zapataband110 hours ago|

[-]

[flagged]

reply

upvote

by BalinKing8 hours ago|

[-]

From the site guidelines (https://news.ycombinator.com/newsguidelines.html):

> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

reply