undefined

upvote

points

by julianlam19 hours ago |

upvote

by kgeist18 hours ago|

[-]

Every new proprietary model is "groundbreaking" and "look, it just solved task X that no other model could solve," only to be referred to as "that crappy previous-generation model" a month later.

So yeah, I'm totally fine using Kimi-2.7, GLM-5.2 or Deepseek-v4. I think we've already hit the ceiling and most improvements now seem to be from harness improvements and slightly better RL to improve reasoning/tool calling.

reply

upvote

by jbverschoor17 hours ago|

[-]

Not only that, but to me it seems that after a week the intelligence is being downscaled or routed. Maybe because of lack of capacity

reply

upvote

by conception6 hours ago|

[-]

You can check https://marginlab.ai/trackers/codex/

It’s pretty good at catching when performance is degraded. It was for a week or so before Fable launched for instance, probably due to a/b testing or capacity as you noted.

reply

upvote

by matheusmoreira16 hours ago|

[-]

There's at least the possibility that they intentionally degrade the models as time passes. We can't really verify that we're getting what we're paying for all of the time. All the more reason to invest in local inference.

reply

upvote

by inigyou15 hours ago|

[-]

What if the new model is exactly as good as the last model on launch day but better than the last model was on the new model's launch day because it was degraded? Every single time?

reply

upvote

by foo4212 hours ago|

[-]

Makes me think of [shepherd tones](Shepard tone - Wikipedia https://share.google/xooRbF7wIIhcsTt2J) which sounds like they're rising in pitch indefinitely

reply

upvote

by inigyou2 hours ago|

[-]

why are you linking to Wikipedia in invalid markdown format, which wouldn't work on HN even if it was valid, to a site called share dot google?

reply

upvote

by no-name-here12 hours ago|

[-]

There are lots of benchmarks to compare the absolute values of different models on the same scale (as opposed to vibes (my apologies for the shorthand), etc.).

reply

upvote

by matheusmoreira13 hours ago|

[-]

The thought has definitely crossed my mind. I don't think it's true because there's definitely an improvement when new models are released.

Maybe the truth is the newest models aren't actually as impressive as we thought. Maybe our perception of progress is being manipulated via months of gradual, silent and unverifiable degradation.

reply

upvote

by LPisGood14 hours ago|

[-]

People talk about this a lot. What I have never seen is a discussion of methods they might employ to degrade the models.

Let’s say I’m a bad faith LLM operator, and I want to degrade my model so the next release looks better and people want to switch to the more expensive one. How would I do that?

reply

upvote

by nessex14 hours ago|

[-]

They would quantize the model. That'd make it cheaper to run, and have slightly worse output but it would still generate outputs with a similar feel, derived from a compressed version of the same knowledge base etc.

They wouldn't even need to do this uniformly, quantized versions of the model could be routed only a subset of the requests. They could do this to nerf the old model, or more likely just to give themselves more hardware to run the new one on by handling more requests on less hardware. Or to handle increased request volume as traffic ramps up faster than hardware can be provisioned.

Playing with local models at various quants, the degradation can be hard to spot. Sometimes it's only noticeable in aggregate. And even then, you never really know if you just got unlucky with a bad response due to RNG.

I've had Opus 4.6 fall into some weirdly incoherent loops that I rarely see from even Sonnet, that felt like the kind of thing I got frequently with Qwen3.5 9B on local. And the above applies... Was that just bad RNG? Or was my request to Opus routed to some lower quality variant? There's no great way for me to tell for any given request, nor any way to guarantee Anthropic _didn't_ do that.

reply

upvote

by OccamsMirror12 hours ago|

[-]

I have had the same experiences you've had with 4.6 and it was ever since they brought out 4.7. It's fairly obvious they're doing something like you've said here.

reply

upvote

by nessex12 hours ago|

[-]

Forgot to mention, but it was after the 4.7 release when I was still using 4.6 that I saw those loops too... Before that, 4.6 had been a pretty seamless experience.

reply

upvote

by tsss8 hours ago|

[-]

And guess what all the providers of open models do: They quantize, badly.

reply

upvote

by csunbird7 hours ago|

[-]

This is why you pay premium for trusted providers, who are verified to not quantize

reply

upvote

by maybe_pablo14 hours ago|

[-]

Weight quantization, n-expert capping, routing to smaller model, context window truncation, aggressive sampling constraints, lossy speculative decoding and probably more.

reply

upvote

by trollbridge9 hours ago|

[-]

I can't prove any of it, but it sure feels like that happens sometimes on Anthropic's platform.

I don't seem to get any of this with GPT-5.5 or GPT-5.5-Pro (not that I use 5.5-Pro enough to know for sure, but when I do use it, it never seems nerfed).

reply

upvote

by alfiedotwtf12 hours ago|

[-]

I'm pretty sure you could do n-expert capping on any MoE model with only a handful lines of changes to ik_llama.cpp, but yeah... my bet is the have various quantisations and run the lower ones at peak (along with different system prompts i.e we're GPU-bound right now. Get to the point with less chatter)

reply

upvote

by Tepix14 hours ago|

[-]

Use quantisation.

reply

upvote

by manyatoms15 hours ago|

[-]

Unless what you're getting is really explicitly spelled out in a contract, you should flatly assume that they're doing whatever they like whenever they like.

reply

upvote

by OtomotO14 hours ago|

[-]

Even if it's in the contract, but can't be verified.

reply

upvote

by taytus16 hours ago|

[-]

At current prices, and considering these OS Models' performance, investing in local inference sounds like a bad idea.

reply

upvote

by matheusmoreira16 hours ago|

[-]

Current prices are insane but at this point I'm starting to feel like it's an existential issue. I'm not a US citizen. At any point the USA could come up with some arbitrary export controls. Not having a computer capable of running at least Qwen is starting to actually seem risky to me.

At least it's going to be usable as a very high end gaming PC.

reply

upvote

by awakeasleep15 hours ago|

[-]

Why would you buy and build everything before the low probability catastrophe strikes, though? You don’t get any benefit from switching early and you pay a big opportunity cost.

reply

upvote

by Lapel274214 hours ago|

[-]

> low probability catastrophe

There is also a low probability that someone enters peace negotiations solely to threaten the negotiators with death, yet here we are. With these guys it is: Better safe than sorry.

reply

upvote

by inigyou15 hours ago|

[-]

because as soon as it strikes computer hardware will be completely unavailable to buy?

reply

upvote

by CamperBob214 hours ago|

[-]

Also, there's a nontrivial learning curve involved in running your own inference server, once you move past the casual-goofing-around-with-llama-server stage. If you care about not being a sharecropper on Sam's or Dario's plantation, you should consider learning the ropes. Even if you don't put these skills to immediate use in your day job.

I didn't appreciate this until I started down that road myself.

reply

upvote

by matheusmoreira13 hours ago|

[-]

> If you care about not being a sharecropper on Sam's or Dario's plantation

Couldn't have put it better myself. That's what all this comes down to. Owning the hardware, owning the inference. Not perpetually renting them out on a meter like in the dystopian future they're envisioning.

reply

upvote

by inigyou8 hours ago|

[-]

You also have the option to not use AI

reply

upvote

by matheusmoreira4 hours ago|

[-]

Yeah but the truth is I don't want to go back to the pre-LLM world. I've been programming alone for over ten years. Having a coding buddy to talk to, collaborate with or just bounce ideas off of quite literally changed my life. I don't want to go back to solo programming, and my projects aren't exactly swimming in a sea of active contributors.

reply

upvote

by CamperBob25 hours ago|

[-]

Not in the future, not if you want to get paid.

reply

upvote

by OtomotO14 hours ago|

[-]

Because you will not be the only one struggling to get the hardware in the "unlikely" case the POTUS blurts out another fart.

reply

upvote

by alfiedotwtf12 hours ago|

[-]

> At any point the USA could come up with some arbitrary export controls

lol his already happened with Fable!

reply

upvote

by jrm416 hours ago|

[-]

At current "proprietary inference company behavior," investing in local inference sounds like the exceedingly far more rational option.

Long term predictability ought to far outweigh a few more cycles of performance.

reply

upvote

by laserlight6 hours ago|

[-]

Don't forget the fact that you'll be questioned to death when you criticize the current generation of models, but somehow, when the new models arrive you'll be questioned to death if you don't find them better than the old ones.

reply

upvote

by trollbridge9 hours ago|

[-]

There are open models with groundbreaking innovations, like MiMo-2.5-Pro-UltraSpeed which you simply can't get anywhere else (there is no other model with those capabilities that I can get with 1000 token/second speed).

reply

upvote

by realusername16 hours ago|

[-]

There's also a lot of benchmark trickery going on, it's becoming harder to see how the latest models really improved.

The top models also seem to have inconsistent performance depending on the time of day and how far we are from the next release.

reply

upvote

by bonesss16 hours ago|

[-]

I’m an LLM fan, but from an engineering perspective the idea of building atop services that palpably fluctuate in capacity, performance, and capability is nutty.

Even with minor automation I feel like I can watch OpenAI and Anthropic engineers fiddling in real-time. Tuesdays behaviour changes by Thursday, 10AMs production isn’t possible at 11:30AM. Nutty.

reply

upvote

by targafarian15 hours ago|

[-]

I chilled significantly on using Google for anything to do with business due to API (and offering) stability. (Still use Google for personal things.) But AI models seem orders of magnitude more fluid, so to my risk-averse eye, they're nothing I'd base my own business on.

reply

upvote

by senordevnyc8 hours ago|

[-]

Imagine having a business where you're at the mercy of the fluctuations in capacity, performance, and capability that your human employees display!

reply

upvote

by intothemild11 hours ago|

[-]

Since I started running my own inference server, I've had zero degradation that I didn't do myself. Basically the only time I see it get worse is if I drop one of the quants.

Which is what I suspect the providers are doing to fit more inference on the same amount of hardware over time.

reply

upvote

by Barbing16 hours ago|

[-]

Interesting, Claude might be doing better since I last checked:

https://marginlab.ai/trackers/claude-code-historical-perform...

There were at least a couple of these degradation trackers.

reply

upvote

by fsuts13 hours ago|

[-]

Agreed

reply

upvote

by 4fffs18 hours ago|

[-]

Correct. Anything else is pure marketing and you have fallen for it.

reply

upvote

by Aurornis16 hours ago|

[-]

> I think it's interesting that people write off open weight models because they're "a few months behind" proprietary models

I experiment a lot with the open models and I’m getting tired of this trope. I’m not yet convinced that even the best open weight models are equal to Opus from “a few months” ago.

I know what the benchmarks say. I had higher hopes. My real experience just doesn’t match the benchmarks.

I also do a lot of work that even Opus 4.8 struggles with. When even the cutting edge LLMs aren’t all the way there yet, my motivation to switch to something even further behind just isn’t there.

reply

upvote

by iot_devs15 hours ago|

[-]

I would love if you could make some examples

reply

upvote

by CamperBob215 hours ago|

[-]

Have you found anything specific that the full-precision quant of GLM 5.2 can't do that Opus 4.8 can? I haven't, so far.

5.2 lives up to the hype. I don't find it to be the best at anything except coding. But for coding... yeah, it lives up to the hype. Not quite Opus 4.8-level, but I would feel comfortable comparing it to 4.5, at least if it had vision capabilities.

reply

upvote

by OtomotO14 hours ago|

[-]

> My real experience just doesn’t match the benchmarks.

That's exactly the problem I have... with Anthropic and "Open""AI"

reply

upvote

by dwoosley17 hours ago|

[-]

The only reason I'm on HN right now reading this post is because the Anthropic's API is down... so there's another point for self hosted.

reply

upvote

by qznc10 hours ago|

[-]

To be a little bit more precise than "a few months behind", what probably matters is before or after "Claude Opus 4.5 from Nov 24, 2025". That was the model which started the OpenClaw hype over Christmas.

reply

upvote

by itwaswatson16 hours ago|

[-]

We have a provider with Deepseek V4 flash at our work. It can handle 95% of the "actually functional" workload at a tenth of the cost. I still pull up beefier ones sometimes, but that's after some consideration.

The moat is so flat, it only gives +1 food and +1 production. +1 gold with a road.

reply

upvote

by calgoo10 hours ago|

[-]

Same, i feel that V4 Flash is great at task implementation, but im still looking at bigger models for design. Now, GLM 5.2 with high thinking is actually getting really close now. I have switched for all personal projects right now and am quite happy with the results. I think the magic is in the big context window (1m) + a lot of thinking gets us very close to at least Opus 4.6 level. Im currently running directly on z.ai with a lite coding plan, and have bought API credit on deekseek as well. I will be looking at EU based hosts next and then i might switch over some of the more critical flows.

reply

upvote

by taormina18 hours ago|

[-]

For that matter, the new models are shit. If I’m using Opus 4.6 anyway to get anything actually done, then great, we’re actually entirely caught up then.

reply

upvote

by 827a15 hours ago|

[-]

Intelligence is maybe a few months behind. But cost sadly is further behind. GLM-5.2 has a deceptively high cost during day-to-day usage for e.g. coding because 1) it has to think a ton more than GPT-5.5/Opus-4.8 to get to competitive results; 2) many providers are still figuring out caching; and 3) API pricing for Codex/Claude can be as high as 40x more than subscription pricing, which distorts the market.

reply

upvote

by Gigachad17 hours ago|

[-]

The reason for me is work pays for Github Copilot which doesn't have these open modals.

reply

upvote

by derwiki6 hours ago|

[-]

OOC did an LLM write this? The last sentence feels very LLM

reply

upvote

by 12 hours ago|

[-]

deleted

reply

upvote

by 16 hours ago|

[-]

deleted

reply

upvote

by TacticalCoder18 hours ago|

[-]

> I think it's interesting that people write off open weight models because they're "a few months behind" proprietary models.

The really interesting thing is that it's typically those very same accounts who were explaining, a few months ago, that thanks to their commercial model they were gaining so much time and producing so much fantastic code.

A few months passes and suddenly the open-source model have caught up with the models that were gaining them so much time and that produced amazing code (in production everywhere for sure btw) but... It's impossible to work with these models.

Rinse and repeat.

The current models, according to them, are basically AGI and they can go fishing while paid subscriptions solve the world's problems.

But when it six months there shall be new closed, pricey, models and when the open ones shall have reach the level of Fable, we'll hear how it's impossible to work in late 2026 on a model that is "only at the level of Fable".

These people should have been snake-oil salesmen (and it could be what they actually are).

reply

upvote

by nemomarx17 hours ago|

[-]

My most charitable interpretation that there's some honeymoon effect for each release, and people genuinely feel very productive and useful for 2-3 months. By the time the next big model release happens they've seen some issues or run into something that makes them feel like the new model will fix all that and improve their flow so much, etc.

Not unusual in the tech space, but this has been basically constantly happening for two years now? I can't imagine the improvements are more than incremental at this point.

reply

upvote

by windexh8er15 hours ago|

[-]

They are generally referred to as the Kool-Aid drinkers. There's always something holding them back from open models. It's no different than the argument in the article. I've been daily driving Linux for well over 20 years at this point and while things have gotten easier they haven't gotten that much easier. There's always been a distro that's focused on new users or ease of use. I used to take for granted the Linux distro ecosystem but now worry how Microsoft, Apple and others will continue to try and legislate compute into a corner. I can appreciate good engineering, but when I look at OS X and Windows they're both failing end users in different ways.

Just like the OS ecosystem I think we'll see a similar trajectory with OAI, Anthropic and Google but on a much accelerated time scale. I think the lobbying has begun to lock in their fate for revenue - because none of them give a shit about their users. I do hope, however, that Anthropic continues to over rotate and continue to gimp their models into uselessness. I just asked Opus 4.8 the other day to look at some code as an adversary and summarize areas that should be addressed. Nothing specific and it shut down the conversation. However starting a new prompt and prodding the model from a different angle yielded the results I asked for directly. Pick a lane. Or, don't and continue to lose industry respect and consideration.

reply

upvote

by tonfreed17 hours ago|

[-]

Even just one of the smaller models is good enough for the grunt work I use them for 90% of the time. Currently doing most of my home hobby projects with OpenCode Go and Qwen 3.7 Plus, it's not great at diagnosing issues in the code, but if I can clearly articulate a test suite or boilerplate refactoring it works fine.

reply

upvote

by nomel1 hours ago|

[-]

> I use them for 90% of the time

10% failure rate would drive me absolutely insane.

reply

upvote

by moomoo1115 hours ago|

[-]

ok but your competition using the latest models has an advantage

not all of us are doing noob shit lol

reply

upvote

by handoflixue9 hours ago|

[-]

You're being entirely unreasonable. 640 kilobytes of memory was enough for Bill Gates, and yet somehow your special project needs more?

reply

upvote

by moomoo1133 minutes ago|

[-]

it's 2026. my work machine is a macbook pro with 64gb ram.

i would rather spend a few hundred or thousands of dollars a month to make way more, than waste time and still lose to people who are using the latest commercial models which are 3 months ahead of the open source.

what are you even talking about?

reply

upvote

by 59nadir11 hours ago|

[-]

Heh, if you're using LLMs heavily for work I think odds are pretty good you're doing pretty trivial stuff. It might not be trivial to you, but you're probably just not very good at this.

reply

upvote

by derwiki6 hours ago|

[-]

Pretty sure the big quant shops heavily use LLM; maybe it’s trivial stuff and they just work 100 hrs/week?

reply

upvote

by 59nadir2 hours ago|

[-]

Anyone who heavily uses LLMs for their work is pretty obviously ill-equipped for their work, and likely getting even worse at it with time, yes. You can throw around as many "things people from the US worship because they make money" positions/industries as you want. Nevermind that you're "pretty sure". People who both know what LLMs are actually capable of without major issues, plus are already capable enough to do their job don't need to use LLMs heavily, they'll use them for what little they're actually useful for. Only incompetent (or uninterested & incompetent) people lean on them very heavily.

Edit: To clarify what I mean by this:

Anyone who uses LLMs for larger-than-small-module code generation, pretend-not-vibecoding (a.k.a spec-driven development), or outright vibecoding, etc., is using an LLM "heavily", IMO.

The appropriate things to use them for is information retrieval, plus as a basic extra signal in debugging, code understanding, quality checks, and so on.

Also, it's not illegal to be incompetent. Most people were incompetent long before LLMs showed up, it's not some rarity.

reply

upvote

by moomoo1132 minutes ago|

[-]

bro go do your tickets lmfao

reply