It's 2026 and the idea that even with detailed-enough requirements you can one-shot even a workable (let alone perfect) solution also needs to die. Anthropic failed to build even something as simple as a workable C compiler, not only with a perfect spec (and reference implementations, both of which the model trained on) but even with thousands of tests painstakingly written over many person-years. Today's models are not yet capable enough to build non-trivial production software without close and careful human supervision, even with perfect specs and perfect tests. Without a perfect spec and a perfect human-written test suite the task is even harder. Maybe in 2027.
" It lacks the 16-bit x86 compiler that is necessary to boot Linux out of real mode. For this, it calls out to GCC (the x86_32 and x86_64 compilers are its own).
It does not have its own assembler and linker; these are the very last bits that Claude started automating and are still somewhat buggy. The demo video was produced with a GCC assembler and linker.
The compiler successfully builds many projects, but not all. It's not yet a drop-in replacement for a real compiler. The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
The Rust code quality is reasonable, but is nowhere near the quality of what an expert Rust programmer might produce. "
For faffing about with a multi agent system that seems like a pretty successful experiment to me.
Source: https://www.anthropic.com/engineering/building-c-compiler
Edit: Like I think people don't realize not even 7 months ago it wasn't writing this at all.
Anthropic said the experiment failed to produce a workable C compiler:
- I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
- The compiler successfully builds many projects, but not all. It's not yet a drop-in replacement for a real compiler.
(source: https://www.anthropic.com/engineering/building-c-compiler)
Software that cannot be evolved is dead software. That in some PR communications they misrepresented their own engineer's report is beside the point.
> It compiled multiple projects successfully albeit less optimized.
150,000x slower (https://github.com/harshavmb/compare-claude-compiler) is not "less optimised". It's unworkable.
> Like I think people don't realize not even 7 months ago it wasn't writing this at all.
There's no doubt that producing a C compiler that isn't workable and is effectively bricked as it cannot be evolved but still compiles some programs is great progress, but it's still a long way off of auonomously building production software. Can today's LLM do amazing things and offer tremendous help in software development? Absolutely. Can they write production software without careful and close human supervision? Not yet. That's not disparagement, just an observation of where we are today.
Does anyone know how the 158000x slowdown happened? That's quite ridiculous.
I never claimed they could! I just view this as a successful experiment. I don't think anthropic was making that claim with their experiment either.
It feels reflexive to the moment to argue against that claim, but I tend to operate with a bit more nuance than "all good" or "all bad".
The overall impression given was inaccurate and the implicit claim of a fully working end-to-end generated compiler was inaccurate. The headlines were incomplete in a way that was intentionally misleading. It was an interesting experiment and somewhat impressive but the claims were overblown. It happens.
You can call that a success (as it did something impresssive even though it failed to produce a workable C compiler) but my point in bringing this up was to show that today's models are not yet able to produce production software without close supervision, even when uncharacteristically good specs and hand-written tests exist.
Edit: Maybe uncharitably is too strong, but we're talking past each other.
> It's 2026 and the idea that even with detailed-enough requirements you can one-shot even a workable (let alone perfect) solution also needs to die.
and brought up the failed anthropic experiment as proof of that. Yes, you are talking past each other, but that is not pron's fault. It is your fault.
I don't think they tried to do that though.
> today's models are not yet able to produce production software without close supervision, even when uncharacteristically good specs and hand-written tests exist.
That's a good point anyway
Their compiler fails to compile (well, at least link) some C programs altogether, and in other cases it produces code that is 150,000x slower than a real C compiler with optimisations turned off (interestingly, the model trained on the real compiler's source code). That's not "not competitive" but "cannot be used in the real world". But even more importantly, the compiler cannot be fixed or evolved. It's bricked (at least as far as today's models' capabilities go). For any kind of software, not being able to improve or fix anything or add any new feature means it's effectively dead.
You could not use it in production even if no other C compiler existed.
- John Carmack embedded a C compiler and interpreter/runtime into Quake back in the mid 1990s as a scripting language! It was that efficient that it could be used in a real time 3D shooter. That's a solo effort as a minor component of a much larger piece of software.
- I've seen university CS courses hand out "implement a C compiler" as a homework / project exercise for students. It's not particularly difficult.
Sure, a modern C compiler like GCC has to handle inline assembly, various extensions, pragmas, intrinsics, etc... but like you said, all of those are thoroughly documented and have open source implementations to reference.
Similarly, the Rust compiler is implemented in Rust and could be used as an idiomatic reference for a generic compiler framework with input handling, parsing, intermediate representations, and so forth.
I would bet that those things are also true of at least one expensive commercial C compiler.
Try it yourself.
I've been using claude to make a project over the last few weeks. Its written ~70k LOC to solve a complex problem. I've found that it can get surprisingly far in a 1-shot, but about 90% of the work I've had it do (measured in time and tokens) is cleaning up the junk it outputs in its first pass. I'm finding my claude sessions have a rhythm like this:
1. Plan and implement some new feature.
2. Perform a code review of what you just did. Fix obvious problems. Flag bugs, issues, poor factoring, messy abstractions, etc. Make a prioritised list of things to fix (then fix them).
3. (Later) fixes:
- Write tests for the code you wrote and fix the bugs you find.
- Run the code through memory leak checks, and fix bugs.
- Do a performance analysis using benchmarks and profiling tools, and make any high priority performance improvements.
- Read the whole program, looking for ways in which the code you've just written could fit in better with the rest of the program. Fix any issues.
- In directory X is the full documentation for the library you're using. Reread it then review the code you wrote. Are there better ways we could make use of the library?
And so on.
Claude's 1-shot output is often usable, but its consistently chock full of problems. Bugs. Memory leaks. Bad factoring. Too many globals. Poor use of surrounding code. And so on. Its able to fix many of these problems itself if you prompt it right. (Though even then the code is often still pretty bad in many ways that seem obvious to me).
At the moment I think I'm spending tokens at about a 1:9 ratio of feature work to polish. Maybe its 1-shot output is good enough quality for you. To me its unacceptable. Maybe a few models down the line. But its not there yet.
https://github.com/anthropics/claudes-c-compiler/issues/1
> Apparently compiling hello world exactly as the README says to is an unfair expectation of the software.
As an example, I did an exploratory attempt to add custom software over some genuinely awful windows software for a scientific imaging station with a proprietary industrial camera. Five days later Claude and I had figured out how to USB-pcap sample images and it's operationalized and smoothly running for months now. 100% of the code written by Claude, it's all clean (reviewed it myself) pretty much all I did was unstuck it at a few places, "hey based on the file sizes it looks like the images are being sent as a 16-bit format")
For day to day work, I'll often identify a bug, "hey, when I shift click on this graphical component, it's not doing the right thing". I go tell Claude to write a RED (failing) integration test, then make it pass.
Zero lines of code manually written. Only occasionally do I have to intervene and rearchitect. Usually thus involves me writing about ten lines of scaffold code, explaining the architectural concept, and telling it to just go
Assembler and linker are not part of a compiler. They are separate tools. They are also generally much simpler.
My first thought when reading Anthropic's description of the experiment was that it is unrealistically easy. It's hard to come up with realistic jobs in the 10-50KLOC range that would be this easy for an LLM. That it failed only shows how much further we still have to go.
I get that it's "novel" creation vs porting, but given that they reported that the C compiler cost them $20k in API costs, the Bun rewrite must be at least $200k, maybe even closer to a million. Pure madness.
Anthropic can always fire the Opus/Mythos token machine gun on any problem (bugs, features, security) to ensure PR success, and there would be plenty of AI-sphere startups already drinking the kool-aid that would consider the whole vibe-coding thing to Bun's benefit.
Can they, though? They tried and failed to do it in their C compiler experiment. The experimenter wrote: "I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality."
Do Firefox not have tests? Then how was there over 200 CVEs found?
Are we going to be comfortable running a piece of software that has 1M lines, and who knows how many zero-days will be in it.
Yes, sure they are going to use LLM to find the CVE's, and so will the hackers. You need a day or two to fix the security issue, a hacker just need to put it in use.
And good luck debugging a million line code base.
1M LOC == already failed.
- "CCC compiled every single C source file in the Linux 6.9 kernel without a single compiler error (0 errors, 96 warnings). This is genuinely impressive for a compiler built entirely by an AI. However, the build failed at the linker stage with ~40,784 undefined reference errors."(https://github.com/harshavmb/compare-claude-compiler)
- Overall it’s an interesting experiment, and shows the current bleeding edge of Claude’s Opus 4.6 model. However the resulting product is also a clear example of the throwaway nature of projects generated almost entirely by AI code agents with little human oversight. The prototype is really impressive, but there is no real path forward for it to be further developed. It can build the Linux kernel [for RISC-V], which is impressive. It can also build other things… if you are lucky, but you really cannot rely on it to work. (https://voxelmanip.se/2026/02/06/trying-out-claudes-c-compil...)
Anthropic themselves said that the codebase was effectively bricked and that their agents could not salvage it.
I can make a c compiler in a couple weeks just by looking up open source libraries and copying them.
I can't make any software that people will pay me money to use without taking months/years of development, research, expiramentation and iteration.
Just because the original people who invented compilers had to be genius, doesn't mean anyone has to spend much time or thought in copying that work now.
If you can truly write a C compiler in weeks then kudos to you. How many compilers have you written so far for how many languages?
I work for big tech and I would say a large % of developers are incapable of producing a working C compiler on any reasonable time scale, certainly not weeks, even with looking at open source. I'm sure they can download one and run it. Most developers today don't even know C or assembler. They don't know how to approach the C language spec. The top 5-10% of developers/engineers can do it but even for them it's non-trivial.
Maybe if you include every application ever written, including every variation of "hello world", but if you are claiming that most serious production quality software could be written by a CS student who is simultaneously working on other classes, I'm gonna have to disagree with you.
There are plenty of open source compilers that I can copy and paste whatever I need to. I don't get why you think this would have any level of difficulty?
Of course I couldn't make a brand new compiler that was better than what's out there...
Just like a game engine, I could clone one of the thousands of engines out there pretty easily - making something better or novel would be difficult. Just making a bare bones clone of what already exists by referencing documentation and pre-existing code is relatively easy now.
Yeah, when I made a mediocre 3d game engine 20 years ago, it was brain breaking difficult work. I can make one infinitely better in a micro fraction of the time now because most of the hard stuff is done and can just be looked up now.
Do you not agree?
Sure. You can clone gcc and build it. You can close a game engine and use it.
That depends on how you count. By number of programs that may well be right, but that's not what matters in terms of impact on the industry, as software value roughly corresponds to the number of people working on a particular piece of software (or lines of code, if you wish). By number of people/LOC most software is not in the "simpler than a C compiler" category.
But yeah, this is not a "one shot" project, none of it is. One shot doesn't work even with humans - after all, this is exactly what killed waterfall as a methodology.
Of course. The point is that a full, detailed spec isn't enough (even in the rare situations it does exist, like for a C compiler). At least for the moment, you need expert humans to supervise and direct the agents.
Vibe coders usually also let the agents write the tests, which mean that the only independent human validation of the software is some cursory manual inspection. That also obviously isn't enough to validate software.
> One shot doesn't work even with humans - after all, this is exactly what killed waterfall as a methodology.
You can one-shot a C compiler with humans. LLMs' software development ability is impressive and helpful, but it is not human-level yet, even if at some tasks the agents are better than most human programmers. And while many waterfall projects failed, many succeeded (although perhaps not as efficiently as they could have). So far I don't believe agents have been able to produce any non-trivial production software autonomously.
The most difficult part of any non-trivial engineering is understanding the problem, and the first versions of a piece of software are how you reach that understanding.
That's why I do not think that AI-powered "software factories" will ever work. It's waterfall development all over again. An architect writing UML diagrams and handing them off to the team of programmers to do the essentially mundane task of implementing... the wrong thing.
AI is, however, very good at helping you go fast from the wrong first version to the less wrong second one. But you need to remember that your main task is to understand the problem that you are trying to solve.
That means EVERY role needed to develop the product was in that team. No separate corporate wide QA function, infrastructure and operations function, sales function, project management function, or domain expertise function. All the people performing those functions for that project were part of the project team.
Now this is somewhat hyperbole as if there is no sharing of resources whatsoever you don’t really have a single corporation.
But the idea is clarifying and helps to eliminate silos and tighten communication and feedback loops.
I miss that style of working. Although I try to break those barriers where I can as an individual contributor by just figuring out who needs to talk to who to make things happen and opening those channels of communication.
I regularly get pieces of work someone product guy has thought up in an afternoon. They only care about the happy path, and sometimes only part of the happy path. I work for a global company that has to abide by rules and regulations in each country we operate in. The product guy thinks up some feature, we implement the feature, then we're told "actually, we legally aren't allowed to do this in 90% of the markets we operate in". Cool, so we add an ability to disable it in those markets. Then they come back "We can do this in some of those markets if it's implemented with [regulatory bureaucracy], so can you do that please".
Then we have to hack away at the solution because the deadline is right around the corner.
This is not software engineering! None of this is related to the software. The job of a software engineer is to take a list of requirements and figure out the way we accomplish those requirements. Requirements gathering is NOT a software engineering problem. Software is implementation, product is behaviour. That's the split. The behaviour of the thing we're building needs to be known before we even try to seriously build it.
If someone just held back for week and did their due diligence, we would been able to architect a solution that is scaleable, extensible, easy to maintain and can make the future easier.
That's a theory but I've never seen this work in practice. A piece of software is unique. If it weren't, we'd just use the cp command.
What usually happens is you get a set of requirements that looks simple. Then you start thinking about a design and see 10 different possibilities, each corresponding to a slightly different interpretation of the requirements set. You iterate a few times reviewing the designs with who set the requirements and a few peers and see more possible variations to the requirements. You need to double check its parent requirements up to the master requirements. Then you need to take time/feature/quality tradeoffs, affecting the fulfillment of requirements.
Once starting to implement, you see dependencies to other software (framework, sdk, drivers, language features,...) and understand that other software is not what you thought, or has bugs. Or you see an issue with performance or see that one particular feature becomes unfeasible.
That's where all the complexity goes. AI doesn't change that, but can make prototyping iterations and bug hunting faster, as long as someone holds it on a leash and understands its decisions.
It has to be someone's job to push back on the Product Guy's stupid idea and answer all the awkward questions about the not-so-happy path with it. Unfortunately, because of the way we've ended up with this process, that person is often the engineer tasked with building it, without any effective political power to challenge the design process.
My senior year software engineering class had a whole section on requirements gathering.
I start with something like this prompt:
"This is a research project around <vague statement>. What do competitors, like <x>, <y>, <z> do around this, are there any blog posts or tech talks?
Are there any academic approaches or recent papers around the topic?
Can you survey any related open source projects? I know of <x> and <y>. Please include analysis of activity, github stars, number of downloads on npm/pypi/crates, and search the web for reviews or complaints or positive or negative blog posts from developers.
All claims should have links to the original sources, preferably with quoted text where appropriate.
We are going to write a research plan for how to produce this report.
The implementation of the plan will spawn subagents to survey breadth, then spawn subagents for each depth topic in detail"
How do you translate "send an email to users" as a feature without a Document? ... also an incredibly waterfall thing. We Don't Do That Anymore. Thank goodness. Because it is incredibly inefficient (and not any less error-prone). And the chances that Some Guy who wrote the Document six months ago really understood the actual problem is...practically zero.
One of my favorite waterfall stories. A friend of mine who does contract programming for <big company>, who said that her projects were always delivered exactly on time, so you never had to apply the "double the estimates rule".
"So your projects always finish exactly on the delivery date original given?!" Incredulity!
"Oh no. They usually take twice as long, but the difference is that, first we deliver what they asked for (which arrives exactly on the original schedule date, but is completely unusable); and then we charge them 3 times as much to deliver what they actually wanted (which takes twice as long)."
If I got detailed specs, I’d just be a coding robot. I push that work off onto juniors.
Developers are unlikely only doing development these days. There's ops and support to do as well, so more back and forth is less time doing those things and development.
We need to meet in the middle about requirements otherwise developers will end up doing someone else's job for them.
Then I see a solution! Why don't we simply put the entire company on one big team?
The only other observation is that as you grow teams, communication channels multiply exponentially and at over 6-8 people communication starts breaking down.
So instead you make small "companies" and set a few ground rules which software they build needs to follow, and you are back at a working org producing complex software.
Improved collaboration. Says every new CEO and manager. The notion that this is ever going to be solved especially with different experience, views, agendas etc needs to die too. AI is surely not going to help and with that roadblock iterating faster doesn’t help because then people want to try just for trying.
And yes, architecture and how to actually implement the designs are also part of the requirements.
The code is just the implementation, the actual problem that needs solving is one abstraction level higher.
> My take is that to accelerate processes we should reduce coordination overhead and empower individuals and teams to make decisions and execute on them.
This is funny because it's exactly what the agile/scrum training taught me 20 years ago.