A lot of the focus has been on AI recently.
Three years ago we didn't have software where a non-software engineer can describe what they want in English and get working (-ish) software generated by other software? Is that not "software has gotten a lot better"?
Other than that I'm not sure how we measure "software has gotten better". New applications? More features? How do we measure sloppier? Is Google Maps suddenly taking you the wrong way more often? I'm not really doubting your subjective experience but seriously how do tell? I mean a doc is a doc and a spreadsheet is a spreadsheet.
We're also only about 10 months into models that are powerful enough to potentially make a bigger difference and we are still figuring out how to use them best.
That being said while I agree that measuring better quality of software is vague (part of the reason it is hard for models as well), there are universal things I believe every engineer will agree on. Reliability, uptime, customer feedback, legibility of your engineering, performance, these are things we often optimized for. Google Maps is a bit of a strawman because neither of us (unless you work on it), knows how much agent code there is, I think it is likely that it's little since it was working fine prior to 2023. I could bring up github reliability as an example, given how much copilot usage they promote at MS, but once again only folks there know for certain. I do, however, see scores of various AI powered SAAS that looks like it is in a perpetual MVP state. I think you are right in that even if agents give us "good enough" results and we can swallow failure rates and our increasingly lesser understanding of what we, or more so model, created, then it is still progress overall, but this is progress not to human-AI collaboration but to AI-only engineering IMO, this is good or bad depending on how you view the future.
I'm a scientist and most of code I currently write is somewhere on the intersection of critical software and machine learning, squaring these two is not easy and I guess the way I was taught to reason about engineering informs my opinions on this. Maybe it's just a matter of time before codex can help here in an unconstrained manner as well, but I am skeptical at the moment.
If AI today can make you more productive that's already progress. If it can't then maybe it makes other people more productive.
A terrible metric is _worse_ than no metric. A terrible metric can _only_ lead you in the wrong direction. "No metric" means saying we don't know, and that leads us to stop and reconsider. But we've taken "move fast and break things" as a mantra, and we'd rather run towards any direction than stay still.
Using LoC as a metric for quality of LLMs will promote LLMs that write more code. It's better to say we have no way to compare different LLMs than it is to say "let's use the LLMs that produced more LoC because at least we can measure that". We, as an industry, should be focusing on developing better metrics for quality, not on improving LLMs based on known-bad metrics. We should be turning to the computer scientists, not to the venture capitalists.
When a pundit talks about how many lines of code an LLM has created, we should lose all respect for them. It's as if someone talking about physics measured the phlogiston, or as if a doctor started measuring our skulls. We know these theories don't work, and anyone using them should be mocked.
Funny you mention that because I had that issue in a cab just yesterday. Google decided to drive us of the main road to a series of small roads which happened to be a dead end. My guess is that the AI decided that this is a shorter road? less busier road?
That being said, Google maps have been gradually degrading. Most notably, its search function is quasi-broken now.
But also, bigger projects need some amount of loc written and it's a bit silly to pretend that this is not the case or a bad thing.
So the answer to the question is roughly: Establishing that an agent can work in a large-ish code base is valuable, because 1) them not being able to do so has been a critique and 2) it's something that is required for a lot of software projects.
Lines of Code is a meaningless measure. It should also be easy to count function points using AI.
So if anything, we should find a way to aim for as little lines of code as possible. If you have two agents, and one can build exactly the same program as another, but with half the LoC, then most likely the first agent is better at software engineering and particularly software design.
Of course, as the author of an experiment that investigated exactly this, I'm slightly biased. Cursor's browser had millions lines of code which sounded weird to me based on the features and functionality it had. Meanwhile, I built the same thing but actually thinking about the design with the agent and ended up with ~20K lines of code instead.
(To state it in AI lingo:)
It's not about the best measure for "amount of code".
It's about wether "amount of code" is a good metric to begin with.
Such as a 4D raytracing engine in Metal? Or integrating APIs for features first released months after their knowledge cut-off date?
LLMs have shown an ability to transfer "knowledge" and capabilities across domains, languages, and use-cases outside their training data.
Case in point: GPT-2 "learning" to translate English to French and vice versa despite non-English examples having been voluntarily (and almost entirely) removed from the dataset.
3.7 Translation
> Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step. In order to con- firm this, we ran a byte-level language detector2 on WebText which detected only 10MB of data in the French language […]
[0]: https://cdn.openai.com/better-language-models/language_model...
The actual "code" is everything driving the harness.
The current problem for this is that the harness is not (yet) deterministic, so it's sort of like having a compiler where your output program works slightly differently every build, and then the compiler tries to just patch the binary programs when you recompile to minimise this problem, or even worse, disassembles the whole thing to figure out what it does, makes the chance, and then recompiles it.
> Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility
I asked a question from a perspective of a human engineer, as in, I will have to read the code and understand, fix it once it breaks. OpenAI approach is opposite, even if it is breaking it is the agent that will be doing the fixing, millions of lines and inelegant designs don't matter because human readability doesn't matter. In any case you use more tokens so you fork over more money.
I will say, however, that IMHO there is objectively bad and good code in terms what it can do and performance, if I can do the same thing in 50 lines as opposed to 1000 lines, this difference still matters for the model. Smaller context usage, better approach that informs downstream generation.
I created docs-cli (pypi) to manage the index of specs as source code: the framework that goes with it will first create tests for as much as it can, so reproducability becomes the goal, not readability.
I have also grown skeptical of token usage in order to run up my bill! But since I feel like it takes me MORE effort to write LESS lines of code myself, I'd expect a quick and dirty AI-generated solution to be MORE lines of code and cost LESS to generate than a concise/elegant solution in LESS lines of code.
Maybe you have access to some other model?
They often do, but they often don’t. I regularly have to push for more elegant, or less lazy solutions.
Insisting on writing code by hand when LLMs are available is not software engineering in 2026. Engineers find the most cost-effective solution for the problem at hand that meets the requirements.
Whether or not that complexity is warranted is a different story.
The codebase may be bloated by a factor of 10 but if the costs associated with that are less than the costs of developing the software from a business standpoint the choice is clear.
The only people I know that have LoC/token use/etc metrics imposed on them work for big corps where such things are (or used to be) en vogue.
The what now? Search engines failed me here.
https://worrydream.com/refs/Kay_2007_-_STEPS_2007_Progress_R...
This is the final report:
It is a metric. It is often not a good metric. But it is easy to measure.
The simple answer is that promoting locs as a relevant metric is also reward hacking. Is it easier to promote big loc counts as a key metric, or is it easier to prove agentic engineering against harder metrics?
On a more general note, software practice marketers have been pushing in that direction for quite a while. "You need cloud", "Here's how to do agile at scale", "microservice everything", etc.
To generate elegant code with more restrictions, it means more thinking tokens and more stronger adherence to instructions. So tha naive view that they are doing it for billing is wrong.
Everyone is over-complicating the explanation. The answer for "why are we fixating on this bad metric" is almost always the same pattern.
Broad audiences need simple metrics to talk about. If the metric itself requires nuance, it's hard to communicate and hard to reason about. It's easier to push the need for nuance from understanding the metric itself down the road to where the metric is applied, which allows everyone to ignore it in immediate conversation.
Now someone can argue that lines of code are not a good proxy of engineering productivity, but I wouldn’t be surprised if the audience they target with this content is not the HN commenters of this thread.
Compare machine to machine (as these headlines come) and discount that by a factor.
This is a problem of conflicting incentives that exists today in my opinion. Companies will market greater human-AI collaboration in science and engineering but focus on releasing things like this where it is clear that downstream goal is complete agent ownership over the product, from inception to testing to monitoring. Maybe the speculative future agents will use their own very efficient language to code that won't be readable for people at all. They focus on agent code being readable by agent in the article, as you've said. But in my mind in at least near future, there is a case where your prod will break, you won't be able to understand it or the attempted fixes. Maybe agent will fail to fix it at all and start a massive rewrite. In any case is this different from kicking technical debt down the road along with worse interpretability of what you have built?
I do think there is a way where agent can write great solid code that we can read, but with the way LLMs are built this requires something new in terms of reward that accounts for "taste" and constant refinement so it might take more than 1/10th of a time to produce something good.
That is a business win. That is really all that matters in capitalism.
The flex is a direct insult to your face. He is shitting on the faces of all software engineers (me included). It is equivalent to saying we don't need you to code anymore. One man can produce 10x the code.
So why am i voting him up even though he's shitting on my face? Because what he says is true. I value honesty and people who say things like it is. Yes my identity as a software engineer is getting dismantled before my very eyes. But the solution to this problem isn't some delusional statement about not understanding what he's flexing about. We're not stupid. Everyone on this thread understands his flex. The difference is some people like you don't want to understand it.
Like seriously. He literally wrote it was completed in 1/10th of the time and you expect me to believe that YOU don't know what HE is flexing about? Be real. You're not stupid.
I’ve worked with 20-year-old codebases and products that grew organically over decades and still sit well below a million lines of code. Using LOC as some kind of health or success metric makes me more suspicious than impressed.
It's like the difference between doing stock price predictions with binary "up" or "down" histories and trying to figure out how to normalize actual price histories (basically impossible). The binary work gives a well-defined signal.