This is such a new and emerging area that I don't understand how this is a constructive comment on any level.
You can be skeptical of the technology in good faith, but I think one shouldn't be against people being curious and engaging in experimentation. A lot of us are actively trying to see what exactly we can build with this, and I'm not an AI influencer by any means. How do we find out without trying?
I still feel like we're still at a "building tools to build tools" stage in multi-agent coding. A lot of interesting projects springing up to see if they can get many agents to effectively coordinate on a project. If anything, it would be useful to understand what failed and why so one can have an informed opinion.
To put a statement like that into perspective (50 times more productive): The first week of the year about as much was accomplished as the whole previous year put together.
The jury is still very far out on how agentic development affects mid/long term speed and quality. Those feedback cycles are measured in years, not weeks. If we bother to measure at all.
People in our field generally don't do what they know works, because by and large, nobody really knows, beyond personal experiences, and I guess a critical mass doesn't even really care. We do what we believe works. Programming is a pop culture.
Most tests people write have to be changed if you refactor.
Where does one get started?
How do you manage multiple agents working in parallel on a single project? Surely not the same working directory tree, right? Copies? Different branches / PRs?
You can't use your Claude Code login and have to pay API prices, right? How expensive does it get?
Set an env var and ask to create a team. If you're running in tmux it will take over the session and spawn multiple agents all coordinated through a "manager" agent. Recommend running it sandboxed with skip-dangerous-permissions otherwise it's endless approvals
Churns through tokens extremely quickly, so be mindful of your plan/budget.
Obv, work on things that don't affect each other, otherwise you'll be asking them to look across PRs and that's messy.
Now these things are being made. I can justify spending 5-10 minutes on something without being upset if AI can't solve the problem yet.
And if not, I'll try again in 6 months. These aren't time sensitive problems to begin with or they wouldn't be rotting on the back burner in the first place.
We have 500+ custom rules that are context sensitive because I work on a large and performance sensitive C++ codebase with cooperative multitasking. Many things that are good are non-intuitive and commercial code review tools don't get 100% coverage of the rules. This took a lot of senior engineering time to review.
Anyways, I set up a massive parallel agent infrastructure in CI that chunks the review guidelines into tickets, adds to a queue, and has agents spit up GitHub code review comments. Then a manager agent validates the comments/suggestions using scripts and posts the review. Since these are coding agents they can autonomously gather context or run code to validate their suggestions.
Instantly reduced mean time to merge by 20% in an A/B test. Assuming 50% of time on review, my org would've needed 285 more review hours a week for the same effect. Super high signal as well, it catches far more than any human can and never gets tired.
Likewise, we can scale this to any arbitrary review task, so I'm looking at adding benchmarking and performance tuning suggestions for menial profiling tasks like "what data structure should I use".
That sounds like a completely made up bullshit number that a junior engineer would put on a resume. There’s absolutely no way you have enough data to state that with anything approaching the confidence you just did.
It is based on $125/hr and it assumes review time is inversely proportional to number of review hours.
Then time to merge can be modelled as
T_total = T_fixed + T_review
where fixed time is stuff like CI. For the sake of this T_fixed = T_review i.e. 50% of time is spent in review. (If 100% of time is spent in review it's more like $800k so I'm being optimistic)
T_review is proportional to 1/(review hours).
We know the T_total has been reduced by 23.4% in an A/B test, roughly, due to this AI tool, so I calculate how much equivalent human reviewer time would've been needed to get the same result under the above assumptions. This creates the following system of equations:
T_total_new = T_fixed + T_review_new
T_total_new = T_total * (1 - r)
where r = 23.4%. This simplifies to:
T_review_new = T_review - r * T_total
since T_review / T_review_new = capacity_new / capacity_old (because inverse proportionality assumption). Call this capacity ratio `d`. Then d simplifies to:
d = 1/(1 - r/(T_review/T_total))
T_review/T_total is % of total review time spent on PR, so we call that `a` and get the expression:
d = 1 / (1 - r/a)
Then at 50% of total time spent on review a=0.5 and r = 0.234 as stated. Then capacity ratio is calculated at:
d ≈ 1.8797
Likewise, we have like 40 reviewers devoting 20% of a 40 hr workweek giving us 320 hours. Multiply by original d and get roughly 281.504 hours of additional time or $31588/week which over 52 weeks is $1.8 million/year.
Ofc I think we might more than $125 once you consider health insurance and all that, likewise our reviewers are probably not doing 20% of their time consistently, but all of those would make my dollar value higher.
The most optimistic assumption I made is 50% of time spent on review, but even this might be pessimistic.
Overall effort was a few days of agentic vibe-coding over a period of about 3 weeks. Would have been faster, but the parallel agents burn though tokens extremely quickly and hit Max plan limits in under an hour.
If you have a really big test suite to build against, you can do more, but we're still a ways off from dark software factories being viable. I guessed ~3 years back in mid 2025 and people thought I was crazy at the time, but I think it's a safe time frame.
Obviously no users will see a benefit directly but I reckon it'll speed up delivery of code a lot.
The long tail of deployable software always strikes at some point, and monetization is not the first thing I think of when I look at my personal backlog.
I also am a tmux+claude enjoyer, highly recommended.
Trying workmux with claude. Really cool combo
I actually had a manager once who would say Done-Done-Done. He’s clearly seen some shit too.
There is a component to this that keeps a lot of the software being built with these tools underground: There are a lot of very vocal people who are quick with downvotes and criticisms about things that have been built with the AI tooling, which wouldn't have been applied to the same result (or even poorer result) if generated by human.
This is largely why I haven't released one of the tools I've built for internal use: an easy status dashboard for operations people.
Things I've done with agent teams: Added a first-class ZFS backend to ganeti, rebuilt our "icebreaker" app that we use internally (largely to add special effects and make it more fun), built a "filesystem swiss army knife" for Ansible, converted a Lambda function that does image manipulation and watermarking from Pillow to pyvips, also had it build versions of it in go, rust, and zig for comparison sake, build tooling for regenerating our cache of watermarked images using new branding, have it connect to a pair of MS SQL test servers and identify why logshipping was broken between them, build an Ansible playbook to deploy a new AWS account, make a web app that does a simple video poker app (demo to show the local users group, someone there was asking how to get started with AI), having it brainstorm and build 3 versions of a crossword-themed daily puzzle (just to see what it'd come up with, my wife and I are enjoying TiledWords and I wanted to see what AI would come up with).
Those are the most memorable things I've used the agent teams to build in the last 3 weeks. Many of those things are internal tools or just toys, as another reply said. Some of those are publicly released or in progress for release. Most of these are in addition to my normal work, rather than as a part of it.
For 3-4 years I've been toying with this in various forms. The idea is a "fsbuilder" module that make a task that logically groups filesystem setup (as opposed to grouping by operation as the ansible.builtin modules do).
You set up in the main part of the task the defaults (mode, owner/group, etc), then in your "loop" you list the fs components and any necessary overrides for the defaults. The simplest could for example be:
- name: Set up app config
linsomniac.fsbuilder.fsbuilder:
dest: /etc/myapp.conf
Which defaults to a template with the source of "myapp.conf.j2". But you can also do more complex things like: - name: Deploy myapp - comprehensive example with loop
linsomniac.fsbuilder.fsbuilder:
owner: root
group: myapp
mode: a=rX,u+w
loop:
- dest: /etc/myapp/conf.d
state: directory
- dest: /etc/myapp/config.ini
validate: "myapp --check-config %s"
backup: true
notify: Restart myapp
- dest: /etc/myapp/version.txt
content: "version={{ app_version }}"
- dest: "/etc/myapp/passwd"
group: secrets
I am using this extensively in our infrastructure and run ~20 runs a day, so it's fairly well tested.More information at: https://galaxy.ansible.com/ui/repo/published/linsomniac/fsbu...
They built the popular compound-engineering plugin and have shipped a set of production grade consumer apps. They offer a monthly subscription and keep adding to that subscription by shipping more tools.