That's an insane level of throughput. What's a good baseline? Prior to agentic coding, whats the typical number of PRs engineers were expected to push? Maybe a 2-10?
Do people feel the software has gotten better in the last 6 months? The number of engs is prob the same so we should expect maybe 5x faster cycle in major software apps, but I don't see it. The AI apps do change very fast but given its a very new field, I'd expect as much. But outside of that, I don't see it.
You end up with about 3 lines added per commit, which is not ridiculous when you consider that most would be editions rather than full additions.
Here, we have 1500 PRs and 1M LOC, which is about 650 added LOC per PR. Remember, not 650 lines total in the PR, but +650 balance after additions-removals.
Fun questions for attentive readers:
- What does a project growing at a rate of one full firefox-codebase worth of LOC per year look like, a decade down the line?
- What does the line count say about the verbosity of the tool, and what does it say about outcomes that the purpose of the project isn't clearly disclosed?
- Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?
- If it was confirmed that LLM usage blows up your line count, what's the implication for codebases that want to return to manual coding after months of usage? (Say, because the tool gets expensive).
Yes, at least to the extent that we care about context windows and tokens consumed by coding agents processing code that is ultimately irrelevant to their assigned task.
Anecdotally, I've found keeping file sizes small has been important for agentic coding not just to maintain human readability, but also for optimizing agent performance, precisely because it limits the amount of incidental context they load while working a problem, because they generally load entire files rather than just parsing the part relevant to their current assignment as a human might. That smaller file size thus reduces input noise and the LLM generates a tighter solution, which in turn reduces input noise for future solutions. Or at least this strategy avoids a death spiral into exploding context length.
I expect (but cannot currently prove) that keeping overall LOC down yields similar benefits even when file sizes are kept small because it spares the LLM from parsing potentially relevant files that prove irrelevant to its current task.
A notable flaw here is that I’ve not tried large vs small files in a large codebase. Most of my experimentation there has been on personal projects where even a small file contains a significant part of the project. I could see degradation when it has to load 5 files to figure out how something works.
Total LOC (tokens, really, literal lines probably don’t matter) is interesting as a factor. That might go some way towards explaining why LLMs are weirdly good at Clojure.
Eg last I checked Anthropics one-shot performance on Clojure was about the same as Python or Go despite almost certainly being less represented in training data. The combination of density and simple primitives might be easier for an LLM to wrangle, ameliorating the impact of a less popular language.
There might be tons of confounding factors there. One that comes to mind is the quality of of data, it might perfectly be that the average clojure snippet is higher quality, due to the users demographics. Very few people start writing code with clojure, whether in college or during bootcamps.
For some reason most of the uses of "agents" are to build yet other AI products, it's turtles all the way down. Maybe that says more about the field of harnesses than it does about the power of "agents".
There is of course another sense in which the output quality is the only thing that matters. “Can I use agents to build a 1m line codebase that I want to maintain going forward.”
I take this as being exclusively a tech demo of the former. Quality (feature velocity, bugs, scalability) is not demonstrated.
Agents help a ton with the discovery, but the act of building a product needs a deeper level of thought and validation to make it actually better than what came before. So IMO what you see is people still learning what needs to be understood and crafted first hand to make a product better (including economics)
We’ll get there if more of us try
Looking at MS Office I notice a lot of small changes recently that are mostly annoying. Things like Word comments losing the focus after you @-tagged a colleague, needing to click the Outlook search field twice before you can enter text, Outlook mobile date picker losing its ability to show your and attendee's availability.
So it looks like lots of throughput, but unfortunately breaking features that work. Or wasting time on things that don’t matter such as the status bar of OneDrive search circling around the input field.
I do use claudecode totally hands off too however. Mostly for UI tasks. Like themifying css or data grids and CRUd with all the bells and whistles, I hate that stuff and cc gets it done in minutes and mostly right. It’s also super nice to say things like “user profile in the upper right hand corner” without having to fight css.
/if it’s not clear, I hate dealing with css and related frameworks.
The dopamine hits are core to why people even do vibecoding (or vibecoding-in-a-dress/spec-driven development) and why they tend to overestimate its output so much. Hell, it's core to all forms of LLM-assisted development (because it feels like magic), but most of the other forms are more value, less delusion.
First hit on Google
But I’m not dismissing your concern. Because it is one of the reasons I’m making this decision. I’m a professional. I’m not just here to feel good I’m here to do a good job over the course of a career. I think all in, when you think about writing good maintainable, software, learning, staying mentally sharp, and speed put together. Vibe coding could be less effective and maybe even in the aggregate “slower”.
The root cause is that the acceleration is pareto distributed so the modern engineering team at the moment looks like one 10x engineer, one 5x engineer, and the rest are approximately 1.5x engineers.
Prior to ai autocomplete 500 loc a day and then with ai autocomplete I could do 2500 a day and now 50k is pretty normal. Walking around tech week with my phone yielded 150k this week
This almost reeks of "I've never cleaned up our code base because there is too much code, and didn't even bother having agents/LLM cleaning them up".
You almost never need a million lines of code - this includes your software, infra, testing and operational tools. You didn't ship the linux kernel in 3 weeks and you know it. The code is already speghetti and it achieve the basic functions OK but it will harder and harder to simplify and untangle and maintain.
To what end and what would that even look like though? Enshittifying everything at maximum speed? The apps/platforms I use regularly - GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.
We easily forget that the great majority of software engineering is fixing the mistakes of other highly capable software engineers.
It's just so easy to blame the machine instead of admitting no one here is an expert on anything and they count their hits and not misses. If they did, we would find the probability of making a mistake to be higher than a fronter coding agent.
It's a hard headed crowd and everyone, LLM pilled or not, suffers from the Dunning-Kruger. All of us.
Just look at the comments. Everyone is perfect when they do things themselves.
What if AI lets you create new versions of those tools, but without the enshitification?
I say that being in the "soaking" stage of using AI to rebuild a shitty software project in 70KLOC over about 2 weeks of spare time, so this may not be as theoretical as you might think.
It's just that creating great software isn't really the SV/VC/big tech business model or main goal.
I'm not sure I fully understand what you're saying here. Isn't the value of these tools almost entirely independent of their actual software? That is, we have many good open source, self-hostable forges (Forgejo, sr.ht, etc.), lots of great music player software (Jellyfin, Symphonium, etc.), and decent maps software (OsmAnd and Organic Maps). People use GitHub, Spotify, and Google Maps -- perhaps even _put up_ with their often bad/glitchy software -- because of network effects (all three) and content/licensing partnerships (Spotify/GMaps). That proprietary data isn't something AI can help you with, right?
If you think about it, successful products rely on designing well-thought-out experiences, customer discovery (see all the Forward-Deployed Enginneer job listings at OpenAI) so the code velocity somewhat becomes irrelevant.
If you’re solving the right problem and you’ve got a good team then competitive advantage comes from somewhere OUTSIDE of code velocity.
The more important question I think is does faster code yield more value long-term? At the moment, it’s like yeah we do 3.5 pull requests per day.
I’m thinking, great, good for you. You could also combine three pull requests into one and then you’re doing 1 per day. This is quantitative data that doesn’t really mean anything tangible.
My point of working on tsz is to learn how to do very big projects with AI. Eventually the same workflows and attitude can be leveraged to build customer product apps with UI as well. I see that OpenAI is leveraging automated browser testing and even videos as part of their workflow. I think as models get better this direction for making software would eventually make sense. I don't think we're there yet though. But at least, unlike OpenAI vague claims I can share the output with you to see!
Most of the solutions that offer a very high level of automation like Lovable are a bit too optimistic and solutions are not tightly coupled with lots of automated testing.
- Give Claude/Codex a way to verify its own work (browser, smoke tests, e2e tests, high-fidelity local environment)
- Keep all context (issue tracking, docs, ideas, plans, worklogs) in-repo (https://github.com/shepherdjerred/monorepo/tree/main/package...)
- Give Claude/Codex access to observability (Grafana, Prometheus, Tempo, PagerDuty)
- Have Claude/Codex follow good engineering guidelines like fail-fast, type safety, parse at boundaries
I haven't yet been able to achieve full autonomy due to cost and CI load on my homelab.
I've found this to be really helpful, e.g. "you did this last week, and now some other thing is happening" or "you tried this approach before to solve alert X but it didn't work" -- except it can discover this itself.
https://github.com/shepherdjerred/monorepo/tree/main/package...
I've also used it to store TODOs and plans. For example I might want to explore some idea and defer it for later, or some weekend have it execute on some tech debt I've put off. One last use case is asking "what did I work on in the last 2-3 weeks, is it healthy, and what additional quality checks can/should I do; is there any follow-up work?"
Essentially preserving logs extends the context window with all related problems.
but overtime if you adjust your verification rubric, it’s not too bad, gets pretty good, if you do make it do TDD, it gets kinda crazy and you’ll have 2000-3000 tests after awhile, or on my common case, 6000-7000 lines of code in single files (i usually have a cron to audit files for decomposition and create tickets)
i wouldn’t use it at my job yet, but it’s been fun to use for personal projects - it’s like modded minecraft automation or factorio
For test growth, maybe use a coverage tracker and remove redundant tests?
Have you been able to extract libraries or tools from this project yet? If so how was that experience?
That is, do you see yourself releasing a metric harness, or sub-projects that are equivalent of ActiveRecord, zod, or similar open source tooling that frequently originate in a large in-house project - and then is exported out as a stand-alone toll, utility, library or framework?
Because while ai can reimplement minor tools, it's utility entirely depends on the existence of solid tools, libraries and frameworks.
Can you share what type of project that was? On the spectrum from a database engine to cat picture sharing web site (very high demand for correctness vs very lax).
- are other teams adopting this approach? What’s the blockers if not?
- have there been problems where the models alone were not enough to debug and the devs had to fix it themselves?
- as the rate of changes has increased with more devs how have you dealt with concurrent writers with merge conflicts?
- if there was anything you could change in the approach you started with, what would it be?
2. Hmm, kind of. There have definitely been issues the models can’t one shot. But we still use Codex to write all the actual code with human guidance.
3. More agents :) Some teams are experimenting with centralized Agent mediated integration queues, others use normal merge queues, many have local Codex threads that monitor CI to resolve and land conflicts or failures.
4. Today’s models and codex app. We started doing all this with gpt-5 and codex-cli. The tools today, 9 months later, are so much better than what we had then.
PRs are not like this because a single bad PR can be catastrophic for your business in a way that a single bad e-vape cannot.
I would also argue that the current output from the AIs when sampled by software engineers regularly doesn't meet the bar of quality we want in our product, hence the need to review every PR and fix a substantial fraction.
If you can start to bound the impact of changes and the outputs begin to be generally acceptable unsupervised, such that all you're doing is double checking that nothing has regressed in the factory, then the sampling approach can work.
As someone that used the $20 plan, this pure agentic approach is impossible to do because I’d hit the limit fast and I would end up with less outcome.
What I found that work incredibly well was to provide a human written code as reference, and ask it to extend it. So I scaffold the entire thing, architect it, write few samples code (controllers, services, models, components, database schema, how auth works, etc) so the LLM can have a headstart on their attention (pun intended)
I usually wrote a stub with a lot of details on how to implement it. Something like a higher abstraction pseudo code. Then ask the LLM to implement it.
When it fails, it is often better to undo the whole changes, adjust the stub so it catches what fails before, and try again.
Or, commit the changes, and use a new fresh context and only address what went wrong.
-
Whenever I tried this agentic from scratch approach, I always end up disappointed; both on the outcome and on the limit that I hit before an hour even passed.
Upgrade to $200/month and you should see more usage but even for a hardcore user for me, one can never have enough.
I'm still very jealous of those guys that got 200x usage simply by RSVP'ing to openai party
For example, actually doing a walkthrough of how to set up these allegedly super powered workflows and concrete demonstrations.
I’m not an AI skeptic. Rather I’d don’t want to miss out on any actual super powers.
Basically, I am moving from “I build products without writing or reading the code” to “I build products without writing or reading the harness.”
Once the new implementation harness is prepared, I start it, but I keep the original session open. In that original session, “we” monitor the implementation harness from the outside: how effective it is, where the bottlenecks are, what breaks down, and what could be improved. From time to time, the monitoring session suggests changes to the implementation harness. We apply those changes, restart the harness, and monitor it again.
The overall approach is not to spend X hours understanding an article like this in detail, because another similar article will appear in 3 weeks. Instead, I take immediate action, learn on the fly, and replace the harness when a better pattern emerges. And yes, I still have to spend X hours on setting up, monitoring and fine tuning the new harness, but at the end I have the latest fancy "thing" working for me.
- write gherkin features for new features; update them for enhancements; don't touch them for refactors. Label your PRs with these nouns.
- use pre-push hooks for type checks, linting, unit tests, and other quick, scriptable validations.
- make a viteperess subsite in your repo, have the agents maintain it - document important principles, architecture, etc.
- make a cli command which lists all pages along with the yaml frontmatter description so agents can choose what to read without blowing up the context window.
- use ddd and monorepo - write your logic in headless layers, and compose layers into apps. agents navigate layers very successfully.
- use zod (or your language equivalent) and contract-first API development; this is my favourite bit tbh, I use orpc
- make a single skill called "code" which describes the lifecycle: open a worktree, setup .env to guarantee no conflict with other agents (choose unused ports etc - docker is good here), write or update feature file (this is where you negotiate the spec), implement, validate (e.g. using playwright mcp), pre-push checks, push and wait for review, tear down and fast forward main
- testcontainers is great for ensuring multiple agents can run tests that don't conflict
Seriously I only have one skill that's it. Everything else is in the docs. I'm feeling very productive like this, in a "making good software" sense not a LoC sense.
I'm building a skill + CLI tool along those lines (for solo devs not corporates). Here is what my "lifecycle" type skill looks like right now: https://github.com/bitkentech/shipsmooth/blob/releases/dist/... (warning, heavily work in progress). You can see a demo here: https://shipsmooth.net/
I was not happy with the default code quality generated by Claude Code. So I've been adding some skill-file rules to address that, and so far happy with the results: https://github.com/bitkentech/shipsmooth/tree/main/skills/ex.... There was a similar one on HN yesterday called opencodereview: https://news.ycombinator.com/item?id=48406358
There are many such workflows out there! Matt Pocock gave a good talk about how he approaches it: https://www.youtube.com/watch?v=-QFHIoCo-Ko
Also, a skill is not a harness.
Many people use the term harness to refer to the agent coding software (eg. Opencode, Claude Code...), i use this term more broadly to refer to the environment (set of skills, system prompts, constraints, memory, hooks etc...). What the OP is referring to is not just one giant skill. It's usually a comprehensive ecosystem of skills, bespoke tools to make certain agent tasks deterministic (eg localization), and so on.
I've seen someone post Github repos in this thread, these can be very useful especially if you use the same tech stack, but you won't reach the level of productivity reported by successful teams unless you invest substantial time to build your own harness. But the way to do so is to do it progressively : start with something simple to address the need you have on day 1 . And then, turn recurring prompts into skills, turn recurring coding patterns and coding style recommendations into guidelines, turn repetivive tasks for which the LLM tends to build a python script that it occasionally gets wrong into a deterministic tool documented in a skill etc...
And after a couple of days, weeks, and months, you'll have a very dependable harness giving you optimal productivity, without needing to invest weeks of work upfront or take the fun out of agent-assisted coding.
Hope this helps.
To do this, I "simply" asked the agent, every time it encountered an issue, how to resolve it, using a validation tool or script. I also asked it to code these tools during audits. As a result, I now have over 30+ rules [2] for validating their commits. It's working pretty well now.
[1] https://github.com/gildas-lormeau/rebuild-and-ruin (let the timer expire to see the "demo" mode)
[2] https://github.com/gildas-lormeau/rebuild-and-ruin/blob/a4c3...
and he gave a talk version of it in london: https://www.youtube.com/watch?v=am_oeAoUhew
Another tip is to condense the doc files into the minimal required. Sometimes I’ll end up with 5 to 6 floating around in various states of staleness. Condensing to 2-3 and removing completed tasks seems to help a lot
And if I am working on an existing codebase then isn't a good commit often a negative sum between added and removed lines? I don't want to bloat my codebase but make it more polished and elegant. After reading that I wonder if what they have done could have been accomplished for a far fewer LoC budget.
A lot of the focus has been on AI recently.
Three years ago we didn't have software where a non-software engineer can describe what they want in English and get working (-ish) software generated by other software? Is that not "software has gotten a lot better"?
Other than that I'm not sure how we measure "software has gotten better". New applications? More features? How do we measure sloppier? Is Google Maps suddenly taking you the wrong way more often? I'm not really doubting your subjective experience but seriously how do tell? I mean a doc is a doc and a spreadsheet is a spreadsheet.
We're also only about 10 months into models that are powerful enough to potentially make a bigger difference and we are still figuring out how to use them best.
That being said while I agree that measuring better quality of software is vague (part of the reason it is hard for models as well), there are universal things I believe every engineer will agree on. Reliability, uptime, customer feedback, legibility of your engineering, performance, these are things we often optimized for. Google Maps is a bit of a strawman because neither of us (unless you work on it), knows how much agent code there is, I think it is likely that it's little since it was working fine prior to 2023. I could bring up github reliability as an example, given how much copilot usage they promote at MS, but once again only folks there know for certain. I do, however, see scores of various AI powered SAAS that looks like it is in a perpetual MVP state. I think you are right in that even if agents give us "good enough" results and we can swallow failure rates and our increasingly lesser understanding of what we, or more so model, created, then it is still progress overall, but this is progress not to human-AI collaboration but to AI-only engineering IMO, this is good or bad depending on how you view the future.
I'm a scientist and most of code I currently write is somewhere on the intersection of critical software and machine learning, squaring these two is not easy and I guess the way I was taught to reason about engineering informs my opinions on this. Maybe it's just a matter of time before codex can help here in an unconstrained manner as well, but I am skeptical at the moment.
A terrible metric is _worse_ than no metric. A terrible metric can _only_ lead you in the wrong direction. "No metric" means saying we don't know, and that leads us to stop and reconsider. But we've taken "move fast and break things" as a mantra, and we'd rather run towards any direction than stay still.
Using LoC as a metric for quality of LLMs will promote LLMs that write more code. It's better to say we have no way to compare different LLMs than it is to say "let's use the LLMs that produced more LoC because at least we can measure that". We, as an industry, should be focusing on developing better metrics for quality, not on improving LLMs based on known-bad metrics. We should be turning to the computer scientists, not to the venture capitalists.
When a pundit talks about how many lines of code an LLM has created, we should lose all respect for them. It's as if someone talking about physics measured the phlogiston, or as if a doctor started measuring our skulls. We know these theories don't work, and anyone using them should be mocked.
Funny you mention that because I had that issue in a cab just yesterday. Google decided to drive us of the main road to a series of small roads which happened to be a dead end. My guess is that the AI decided that this is a shorter road? less busier road?
That being said, Google maps have been gradually degrading. Most notably, its search function is quasi-broken now.
But also, bigger projects need some amount of loc written and it's a bit silly to pretend that this is not the case or a bad thing.
So the answer to the question is roughly: Establishing that an agent can work in a large-ish code base is valuable, because 1) them not being able to do so has been a critique and 2) it's something that is required for a lot of software projects.
Lines of Code is a meaningless measure. It should also be easy to count function points using AI.
So if anything, we should find a way to aim for as little lines of code as possible. If you have two agents, and one can build exactly the same program as another, but with half the LoC, then most likely the first agent is better at software engineering and particularly software design.
Of course, as the author of an experiment that investigated exactly this, I'm slightly biased. Cursor's browser had millions lines of code which sounded weird to me based on the features and functionality it had. Meanwhile, I built the same thing but actually thinking about the design with the agent and ended up with ~20K lines of code instead.
(To state it in AI lingo:)
It's not about the best measure for "amount of code".
It's about wether "amount of code" is a good metric to begin with.
Such as a 4D raytracing engine in Metal? Or integrating APIs for features first released months after their knowledge cut-off date?
LLMs have shown an ability to transfer "knowledge" and capabilities across domains, languages, and use-cases outside their training data.
Case in point: GPT-2 "learning" to translate English to French and vice versa despite non-English examples having been voluntarily (and almost entirely) removed from the dataset.
3.7 Translation
> Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step. In order to con- firm this, we ran a byte-level language detector2 on WebText which detected only 10MB of data in the French language […]
[0]: https://cdn.openai.com/better-language-models/language_model...
The actual "code" is everything driving the harness.
The current problem for this is that the harness is not (yet) deterministic, so it's sort of like having a compiler where your output program works slightly differently every build, and then the compiler tries to just patch the binary programs when you recompile to minimise this problem, or even worse, disassembles the whole thing to figure out what it does, makes the chance, and then recompiles it.
> Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility
I asked a question from a perspective of a human engineer, as in, I will have to read the code and understand, fix it once it breaks. OpenAI approach is opposite, even if it is breaking it is the agent that will be doing the fixing, millions of lines and inelegant designs don't matter because human readability doesn't matter. In any case you use more tokens so you fork over more money.
I will say, however, that IMHO there is objectively bad and good code in terms what it can do and performance, if I can do the same thing in 50 lines as opposed to 1000 lines, this difference still matters for the model. Smaller context usage, better approach that informs downstream generation.
I created docs-cli (pypi) to manage the index of specs as source code: the framework that goes with it will first create tests for as much as it can, so reproducability becomes the goal, not readability.
I have also grown skeptical of token usage in order to run up my bill! But since I feel like it takes me MORE effort to write LESS lines of code myself, I'd expect a quick and dirty AI-generated solution to be MORE lines of code and cost LESS to generate than a concise/elegant solution in LESS lines of code.
Maybe you have access to some other model?
They often do, but they often don’t. I regularly have to push for more elegant, or less lazy solutions.
Insisting on writing code by hand when LLMs are available is not software engineering in 2026. Engineers find the most cost-effective solution for the problem at hand that meets the requirements.
The only people I know that have LoC/token use/etc metrics imposed on them work for big corps where such things are (or used to be) en vogue.
Whether or not that complexity is warranted is a different story.
The codebase may be bloated by a factor of 10 but if the costs associated with that are less than the costs of developing the software from a business standpoint the choice is clear.
The what now? Search engines failed me here.
https://worrydream.com/refs/Kay_2007_-_STEPS_2007_Progress_R...
This is the final report:
It is a metric. It is often not a good metric. But it is easy to measure.
The simple answer is that promoting locs as a relevant metric is also reward hacking. Is it easier to promote big loc counts as a key metric, or is it easier to prove agentic engineering against harder metrics?
On a more general note, software practice marketers have been pushing in that direction for quite a while. "You need cloud", "Here's how to do agile at scale", "microservice everything", etc.
To generate elegant code with more restrictions, it means more thinking tokens and more stronger adherence to instructions. So tha naive view that they are doing it for billing is wrong.
Everyone is over-complicating the explanation. The answer for "why are we fixating on this bad metric" is almost always the same pattern.
Broad audiences need simple metrics to talk about. If the metric itself requires nuance, it's hard to communicate and hard to reason about. It's easier to push the need for nuance from understanding the metric itself down the road to where the metric is applied, which allows everyone to ignore it in immediate conversation.
Now someone can argue that lines of code are not a good proxy of engineering productivity, but I wouldn’t be surprised if the audience they target with this content is not the HN commenters of this thread.
Compare machine to machine (as these headlines come) and discount that by a factor.
This is a problem of conflicting incentives that exists today in my opinion. Companies will market greater human-AI collaboration in science and engineering but focus on releasing things like this where it is clear that downstream goal is complete agent ownership over the product, from inception to testing to monitoring. Maybe the speculative future agents will use their own very efficient language to code that won't be readable for people at all. They focus on agent code being readable by agent in the article, as you've said. But in my mind in at least near future, there is a case where your prod will break, you won't be able to understand it or the attempted fixes. Maybe agent will fail to fix it at all and start a massive rewrite. In any case is this different from kicking technical debt down the road along with worse interpretability of what you have built?
I do think there is a way where agent can write great solid code that we can read, but with the way LLMs are built this requires something new in terms of reward that accounts for "taste" and constant refinement so it might take more than 1/10th of a time to produce something good.
That is a business win. That is really all that matters in capitalism.
The flex is a direct insult to your face. He is shitting on the faces of all software engineers (me included). It is equivalent to saying we don't need you to code anymore. One man can produce 10x the code.
So why am i voting him up even though he's shitting on my face? Because what he says is true. I value honesty and people who say things like it is. Yes my identity as a software engineer is getting dismantled before my very eyes. But the solution to this problem isn't some delusional statement about not understanding what he's flexing about. We're not stupid. Everyone on this thread understands his flex. The difference is some people like you don't want to understand it.
Like seriously. He literally wrote it was completed in 1/10th of the time and you expect me to believe that YOU don't know what HE is flexing about? Be real. You're not stupid.
I’ve worked with 20-year-old codebases and products that grew organically over decades and still sit well below a million lines of code. Using LOC as some kind of health or success metric makes me more suspicious than impressed.
It's like the difference between doing stock price predictions with binary "up" or "down" histories and trying to figure out how to normalize actual price histories (basically impossible). The binary work gives a well-defined signal.
Many times those updates are not properly tested, for example in one update the model selector got completely changed.
then next hotfix was pushed which restored original.
There isn't anything that were not already experienced and factored into constructs in the repo.
And I also find all of the bits created for an effective agentic engineering project, matches perfectly with the main stream engineering best practices. That has been one of my primary reason to all in on agentic engineering, prior to this, applying best practices is always too costly and conflict with teams daily priority.
A. The code is absolute garbage and is speed for speed sake B. They’re using an internal model that is a generation beyond GPT 5.5
I say this because we’ve attempted to do something similar using the latest gen Claude models and a significantly larger team. The code is probably along the lines of millions LoC but is an absolute mess because of vibing. There’s a price you pay for speed
I find there’s a ton of slop unless hard guardrails are added, eg step 1 is just around syntax, step 2 is to enforce mental models
You still need someone steering direction and have a logically consistent idea of what you actually want to build
Q2 - I find that vibe coding really accelerates FE projects because it’s possible to run everything locally and check results
For pure distributed infra backend more investments have to be made into the devloop to be able to shift left the feedback loop and decouple it from humans or real deploys
Additionally it’s an internal tool, which is likely much more amenable to slop.
Can anyone give me a simplified explanation of what they’re saying here? Having some trouble understanding.
For example, if you had a `backend`, `common`, and `frontend` package, you would be OK having backend/frontend depending on common, but you wouldn't want common depending on backend/frontend or backend/frontend depending on each other.
If you think about JavaScript, there is nothing stopping your dependency graph from becoming spaghetti. It sounds like they built static analysis to enforce rules.
Some languages have this built in like Java (Project Jigsaw), Go, and Rust. JavaScript, Python, etc. have no such feature.
It's really nothing special -- it has existed before. It just becomes a _lot_ more important with agents since they produce a lot of code, and it is good to have lots of static analysis when heavily utilizing agents.
They mention this in the article:
> This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.
Eg UI cannot reach down and directly read config files
Configs must be only read by (im assuming) a storage interface layer called repo
There’s a strict directionality of dependency
Somewhat similar to ports and adaptors but presumably more strictly enforced by deterministic linters
1. What’s the job satisfaction like day to day being an engineer on this project? How have they adapted to this way of working?
2. How much did it cost? Work is being done whilst the engineers sleep but if that 6 hours overnight task cost $300 and could have been done by a person in 2 hours is it a real saving?
The job satisfaction is looking at the bank account every time you feel your job sucks.
You end-up spending at least 5x the amount of tokens for maybe prediction machine to find a discontinuity?
I would say a way better approach is 1.123x to generate code + tests + passing analysis tools + human review + 1x "simplify as much as possible", than letting the snake its own tail without boundaries.
Just like .vimrc and .zshrc, the harness "code" itself can be easy and personal. Provided that it's built on working and existing construct such as tmux.
I will do a premise: I don't like where software engineering is heading, at all. I have never been unhappier to work in this field since AI came out. And no, it is not possible to opt out of AI, especially when your teammates are all great engineers whose productivity increased a lot without any drops in quality code-wise (in fact the opposite has happened). You need to keep up. But it's tiring and the fun/interesting parts are disappearing.
That being said, it's clear that harness engineering is the most important part of our job and that task is going to take increasingly more of our time. And thus having a glimpse of how an AI company handles it is by any means interesting.
EDIT: found the button, all the way down in the bottom of the page... I hate this so much, give me the original content, I will decide if and when I need translation
Anyone know some?
These people are so delusional it feels like a mental desease by now.
I really hope no one gets hurt by all this slop code in the future by these wanna be engineers.
If you're a more senior person in tech, this post is effectively saying that a large portion of your skillset is about to become completely worthless. This goes beyond the skills involved in writing the code. Everything that you've learned over years about how to determine whether code is good or bad, and what practices make an engineering team effective is not just obsolete, it's fundamentally counter-productive because it assumes a slow, human-centric process that requires you to actually review and understand the code. Even your ability to mentor junior engineers is now obsolete, because all that experience you've built up is now worthless to them.
If this is the approach the industry takes, particularly when combined with a lack of interest in quality from the business (and let's face it, consumers have shown us that they're happy to pay for cheap crap), it's hard to see much of a future for software engineers. You don't need thousands of people with deep technical expertise, you need a handful of manager-types, who will focus on defining product and business requirements and configuring how the AI gets enough context to implement the requirements.
Maybe, if we're extremely lucky, there's so much demand for software that total employment doesn't fall off a cliff, but the nature of the work will change so much that many older, more expensive engineers will become unemployable. Those who remain will have to accept that the skills they spend decades developing are now worthless, that younger engineers no longer respect or listen to them, that the business no longer sees them as experts worthy of respect, but old fogies who grew up in a different world.
Joe Biden liked to say that a job is more than just a paycheck, it's part of your identity and your sense of self-worth. We're all very used to a certain level of respect (and commensurate remuneration). If you don't think that's true, compare how a software engineer is treated to how a warehouse worker is treated. What happens when we lose that?
I'm not convinced of that.
I watched a video of an architect using AI to create architectural drawings. It became very clear to me that he has a lot of skills and terminology that helped him produce something very specific, in a few minutes. I've been working on some home improvement stuff including a studio/shed and I've struggled to produce even something simple (currently trying to get a conversation packet on the roof trusses to take the the permit department to get started). Even with my high school architecture class.
After watching that I wonder how much of what I'm doing with AI that looks easy is because I hae a deep technical knowledge, plus 3 years of heavy work with AI.
But that's not what this specific article is describing. The world this article is describing is one where you describe the business requirements, and you don't think about how it's implemented. You don't write the code, you don't review the code, you don't test the code. You give the AI business requirements and you give it access to sources of context (slack, meeting notes, etc). Every place where the human would act as a gate reduces throughput, so it should be eliminated through building harnesses and providing context.
What they're doing here is the equivalent of taking a factory where you have 2 process engineers and 100 operators, and replacing all the operators with robots. They want to automate the whole process of making the software and just leave the part that figures out how to make the automation work effectively.
In this world, the average software company doesn't need people who know how to write good software, because writing, reviewing, maintaining, and testing the software will be entirely automated. There will be a small number of people at companies like OpenAI that need to know how to write good software in order to supervise training the models, and there will be a small number of people at the software companies who have expertise in setting up the automation.
That right there is what I'm talking about: that architect would write the requirements for a building way different than I would.
Just because I'm not typing "strcat(); strcpy(); sprintf()" doesn't mean I'm not thinking about problems. I'm still doing critical thinking all over my stack, and I don't see that going away. I'm just doing different thinking.
There are people who think, and AI just isn't going to change that. There are people who don't think, and they've existed long before AI. Back in the 90s when I worked at the phone company, man, I worked with some people who didn't do a lick of work (along with some really sharp people).
Software engineers have always adapted to new technologies. New languages, frameworks, native apps, browser apps etc. So far this doesn't seem to be close to completely removing us from the loop.
If you are smart, educated, and can adapt, you'll figure it out. The economy has to find some stable equilibrium and it's not a zero sum game. Everyone in the economy getting a paycheck is also a consumer. With no consumers there is no business. The companies who are using AI and become more productive can do more things that before were not profitable but now are. Some of the people who are getting laid off are going to start new businesses and hire people. These things always cycle, and they basically have to.
I don't have a crystal ball though.
Artists and writers are unionized, why they have a more powerful collective voice.
Second, there are enough peole for which their jobs are very well paid and too cozy to dare to rock the boat.
The economy and job market isn't so hot either at the moment for people to quickly be able to jump ship.
Can you even be sure that you find a tech company that isn't jumping head first onto the AI hype train? Even politicians can't have enough of AI in their mouth.
artists overvalue their own outputs
casual gaslighting
It's interesting this was submitted to HN over 15 times since it was published in February: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
But this is the only submission that's had any traction. Since the content is nearly the same for all submissions, it highlights how getting to the front page can be a bit random. (Though this is the only one that capitalized 'Leveraged' so maybe that's the secret)
This is such a common thing among software engineers nowadays that I was very surprised that OpenAI would open with that line as if it were mind blowing.
But then I saw it was published in February and OP is just reposting it to farm karma.
Forcing readers to wade through an unceasing string of LLM clichés demonstrates the opposite of the point you’re trying to make—that the consumers of your work are worse off because you exercised no human judgment in creating it.