upvote
Complaining about every one off issue with LLM's ignores the bigger picture: they are getting better every month and there is no fundamental reason why they wouldn't surpass humans in coding. Everything else is secondary.

All I would need from an LLM doubter is evidence that at tractable software engineering task LLM's are not improving. The strongest argument against the increasing general capabilities of LLM's are the ARC-AGI tasks, however the creators admit that each generation of LLM's exceed their expectations, and that AGI will be achieved within the decade.

reply
Your logic is flawed because, a thing can improve for an infinite amount of time while never surpassing a certain limit. It's called an asymptote.

That being said, I don't even think that arguing about this from a mathematical perspective is a worthwhile use of time. Calling something an asymptote in the first place requires defining a quantifiable "X" and "Y", which we don't even have. What we have are a bunch of synthetic benchmarks. Even ignoring the fact that the answers to the questions are known to regularly leak into the training data (in other words, it's possible for scores to increase while capabilities remain the same), there's also the fundamental fact that performance on benchmarks is not the same thing as performance in the real world. And being able to answer some arbitrary set of arbitrary questions on a benchmark which the previous model couldn't, does not have a quantifiable correlation to some specific amount of real-world improvement.

The OP article focuses on research papers which assess real-world impact of LLMs within software organizations, which I think are more representative.

I wouldn't call myself an "AI doubter" - I use LLMs every day. When you say "doubter" you're not referring to "AI" in general, or the fact that AI is helpful or boosts productivity (which I believe it does). You're rather referring to the very specific, very extraordinary claim, that LLMs will surpass humans in coding. If that's the case then yeah I'm a doubter, at least on any foreseeable timescale.

reply
1. There’s no reason to believe AI capability improvement is approaching an asymptote, METR timelines, improvements on benchmarks, ARC-AGI are all at least linear 2. Even if it were asymptotic, it would be a huge assumption to assert that the asymptote is below general human intelligence, like human pattern recognition and cognition is some sort of universal limit like c

Also if LLM’s weren’t really getting better in general but just benchmaxxing, then it would be extremely lucky that this also happens to be leading to a general increase in coding capabilities that have been observed in more recent models.

AI has already surpassed 99% of humans in coding in narrow domains. The question is, how wide does the domain have to be before models no longer ever surpass humans? I’d wager we’d have to wait until scaling of compute infrastructure stops, wait 6 months, then see.

reply
> Your logic is flawed because, a thing can improve for an infinite amount of time while never surpassing a certain limit. It's called an asymptote.

Have you ever once looked at a METR chart? https://files.civai.org/assets/METR_Chart.jpg

That's not an asymptote.

> there's also the fundamental fact that performance on benchmarks is not the same thing as performance in the real world

Again, yes, you're correct in the general case but it has very little to do with the specific case.

Would you find it convincing if I simply said "some internet arguments are wrong"? It's certainly a true statement, and you've made an internet argument here, so clearly you should accept that you're wrong, right?

reply
You're scoring rhetorical points while talking past my entire comment. Hard to say if you even read it.

I'm not "convincing" anyone of anything. I'm stating the reasons that I, personally, am unconvinced of a specific claim being made to me.

reply
I mean, I quoted multiple passages and established why I think your logic is flawed. If you're convinced by bad logic, so be it.
reply
You say this as though performance has not followed a very clear and extremely rapid improvement in a startlingly short amount of time.

You’re definitely right that people adopt agentic workflows and are disappointed or worse, but the point is the disappointment has already reduced substantially and will continue to do so. We know this because we know the scaling laws, and also because learning theory has been around for many decades.

reply
What rapid improvement has occurred, because in this six month AI coding fever dream we've been living in, I really haven't seen anything new in awhile, both in terms of new ideas for AI coding or in new consumer products or services.

I'll give you the coding harnesses themselves are better because that was a new product category with a lot of low-hanging fruit, but have the models actually improved in a way that isn't just benchmaxxing? I'd argue the models seem to be regressing. Even the most AI-pilled people at my company have all complained that Opus 4.7 is a dud. Anecdotally, GPT 5.5 seems decent, but it's rumored to be a 10T parameter model, isn't noticeably better than 5.4 or 5.3, is insanely expensive to use, and seems to be experiencing model collapse since the system prompt has to beg the thing to not talk about goblins and raccoons.

reply
Uninformed opinion of someone who clearly doesnt consistently use AI coding tools, clearly. And why are you limiting it to 6 months? Whats wrong with you?
reply
Why does this _always_ happen in agentic coding convos?

> I don't find $MODEL useful

> CLEARLY you're doing it wrong

It's so dumb.

(I write code w/ agents btw, I'm just also skeptical)

reply
How many years of real-life, in-production problem solving/coding have you done? That's what I base how informed you are not how much you use your favorite new $100/month token-prediction subscription
reply
15 years. But that's irrelevant to this point. The person im replying to clearly doesnt use the tools if they think there hasnt been constant improvement. "token-prediction subscription" is funny, coming from a glorified biological token predictor
reply
I'm starting to think the AI maxis are just misanthropes.
reply
ah yes another feeble fool that thinks his 100$ subscription is equivalent to 400 billion years of evolution simply because he is stupid and watches a lot of scifi.
reply
Nope not at all, but it's most certainly superior to the tokens your neural net outputs
reply
say that you are alive

"i am alive"

OH MY GOD!!

reply
deleted
reply
I’m going to parrot back what you’re saying and you tell me if I’m getting close

- AI coding is a disappointing fad (“fever dream?”). - that has not made meaningful progress in…6 months? - coding harness is improving - model improvements are lies: it’s just businesses “benchmaxxing” and misleading people. Real performance has not meaningfully improved - “opus 4.7 is a dud” - 5.5 suffering from “system collapse” (I’ve never heard this term before)

Since you asked and I assume you are rational and really are interested to know:

- we have many measures of performance and have studied how one particularly important but unintuitive measure (pertaining perplexity) scales with data, compute, and model size. These laws continue to hold and have satisfying theoretical origins.

- whatever the scale of 5.5, consider we have far more room to go on the scaling front. Probably another 2-3 orders of magnitude before we hit limiting bottlenecks.

- that’s also fine because scaling is only part of the puzzle. RL on verifiable rewards is virtually guaranteed to get you optimal performance and that’s the entirety of the excitement around coding agents

- while you are right about benchmarks and measurement science having a ton of weaknesses, they are not at all garbage. There are probably around 40,000 benchmarks in the literature (this is not a made up number by the way it really is around that many). Epoch made a great composite measure using good stats (IRT) called their epoch capability index, METR has done and redone their time horizon measure and it holds up beautifully. There is a ton of signal in many benchmarks and they all tell a pretty compelling story.

- additionally, this is not some unknowable thing. It strikes me as odd that people’s prior on HN a lot of time is “it’s all dumb rich people putting way too much dumb money in this”. Sorry but the world is not that dumb. Trillions of CapEx is usually pretty rationally allocated. And it is!

- why? Because this is already known what happens when you do what we’re doing. When you have a verifiable reward system, have a certain amount of compute available, have seed data to get you to where you can do RL, you will be almost guaranteed to get superhuman performance

reply
I'm pretty sure their mindset is pure cope. All top AI labs are agentically coding 100% now. There's a reason for that. Anyone not on that paradigm yet is either slow acting or purposefully resistant. (excluding workplace policies that hamstring you of course)
reply
Yea that’s what I just can’t wrap my mind around. It’s a cacophony of engineers with authoritative sounding blog posts explaining a subject they seem to have barely a tenuous grasp on. It’s hard to watch a population of tech people I used to really revere getting things so wrong. I thought “surely once we’re <literally where we are today which is what you describe> no one with any self respect would still claim AI is a useless fad or that it shouldn’t be used” and yet to my disappointment that’s where we seem to be.
reply
Perhaps you are confusing performance with instability?
reply
No. Time horizon I’m talking about spans years. “We don’t know” is just wrong, we’ve had scaling laws for many years and they continue to hold up. Benchmarks, in all their ugliness, tell a consistent story.
reply
> very clear and extremely rapid improvement in a startlingly short amount of time.

We're almost 6 months into all this AI-code madness and I've yet to see that "rapid improvement" you mention. As in software products that are genuinely better compared to 6 months ago, or new software products (and good software products at that) which would have not existed had this AI craze not happened.

reply
Way more than six months. You may be talking about how the world looks from your vantage point, as well you should. But there’s a reason why the world doesn’t allocate trillions of dollars of capital based on that.

I really value skeptical people and skepticism generally. But what I think skeptical people would prefer to consider themselves is: rational and reasonable, with their beliefs well calibrated.

You’re not the only one to think that literally nothing major or significant has happened with AI but that’s simply wrong. Every major tech company - the ones poised to get the first best rewards, have already gotten good incremental revenue from AI via ads ranking/recommendations (Google, Meta, etc.), good productivity increases due to scale of workforce and advanced in house tooling. You won’t see these numbers and you don’t have to believe them. But I have seen them and I believe them, and I, like you, hate bullshit.

reply
Classic argumentum ad populum fallacy. The world allocated the equivalent of trillions to the dotcom bubble shortly before it became the dotcom bust, mortgage CDOs before the 2008 debt crisis, and the cryptocurrency mania before its bubble popped. The world has allocated vast sums of money to rather stupid things many, many times in the past.
reply
> Every major tech company - the ones poised to get the first best rewards, have already gotten good incremental revenue from AI via ads ranking/recommendations (Google, Meta, etc.)

That's just software evolving. It happened before LLMs, it would happen without LLMs.

> good productivity increases due to scale of workforce and advanced in house tooling.

Exactly same case.

reply
But I don’t really understand: the ask is for evidence AI is generating meaningful returns and it demonstrably is, even while we have integrated these tools only partially. “Just software evolving” um yes, I agree, just that now this happens faster and more efficiently. It is also more than that: models that power advertising and content recommendation at TikTok, Google, Facebook, Instagram, etc are not just “software evolving” it is meaningful improvements to models that are only possible with good AI.
reply
Can you name literally any other technology that had hundreds of millions of users within the first six months of being invented?

Six months after the internet was invented, you could send email between a few universities.

Six months after the computer was invented, they still hadn't actually built one.

The first transcontinental railroad, took about six YEARS just to build.

reply
GPT did not have hundreds of millions of users when it was invented almost a decade ago…
reply
If you want to move the goal posts, that's fine, acknowledge it: the original claim I'm responding to was "We're almost 6 months into all this AI-code madness"

If you want to set GPT as the target, that's even easier! In that decade it has passed the Turing Test, solved novel open math problems, generates audio, video, and music, and can write coherent code. Again, there is no technology that has improved more rapidly than LLMs.

reply
You say this as though AI company debt has not followed a very clear and extremely rapid ballooning in a startlingly short amount of time.

It's the "YOLO" of business strategies.

reply
Always amazes me how we’re on a platform with “ycombinator” in the url and people don’t understand how private companies scale to capture market share. You’re right Uber was that company that ran at a loss for so long and collapsed, another YOLO business strategy. Or maybe it was Amazon or…hmm I forget
reply
Yes but we don't know the shape of the curve and where we are on it.
reply
See chinchilla scaling laws, we have the functional form of the curve and know the constants (though they change and are domain and model specific):

L(N,D) ~= 1.69 + 406 / N^0.339 + 411 / D^0.285

L is loss (pre training test loss) D is the scale of the data N is the number of model parameters

reply
You need to touch grass dude, seriously.
reply
Why deflect from the conversation and attempt to insult someone? What I’m saying is literally canonical and extremely well known literature.
reply