So yeah, I'm totally fine using Kimi-2.7, GLM-5.2 or Deepseek-v4. I think we've already hit the ceiling and most improvements now seem to be from harness improvements and slightly better RL to improve reasoning/tool calling.
It’s pretty good at catching when performance is degraded. It was for a week or so before Fable launched for instance, probably due to a/b testing or capacity as you noted.
Maybe the truth is the newest models aren't actually as impressive as we thought. Maybe our perception of progress is being manipulated via months of gradual, silent and unverifiable degradation.
Let’s say I’m a bad faith LLM operator, and I want to degrade my model so the next release looks better and people want to switch to the more expensive one. How would I do that?
They wouldn't even need to do this uniformly, quantized versions of the model could be routed only a subset of the requests. They could do this to nerf the old model, or more likely just to give themselves more hardware to run the new one on by handling more requests on less hardware. Or to handle increased request volume as traffic ramps up faster than hardware can be provisioned.
Playing with local models at various quants, the degradation can be hard to spot. Sometimes it's only noticeable in aggregate. And even then, you never really know if you just got unlucky with a bad response due to RNG.
I've had Opus 4.6 fall into some weirdly incoherent loops that I rarely see from even Sonnet, that felt like the kind of thing I got frequently with Qwen3.5 9B on local. And the above applies... Was that just bad RNG? Or was my request to Opus routed to some lower quality variant? There's no great way for me to tell for any given request, nor any way to guarantee Anthropic _didn't_ do that.
I don't seem to get any of this with GPT-5.5 or GPT-5.5-Pro (not that I use 5.5-Pro enough to know for sure, but when I do use it, it never seems nerfed).
At least it's going to be usable as a very high end gaming PC.
There is also a low probability that someone enters peace negotiations solely to threaten the negotiators with death, yet here we are. With these guys it is: Better safe than sorry.
I didn't appreciate this until I started down that road myself.
Couldn't have put it better myself. That's what all this comes down to. Owning the hardware, owning the inference. Not perpetually renting them out on a meter like in the dystopian future they're envisioning.
lol his already happened with Fable!
Long term predictability ought to far outweigh a few more cycles of performance.
The top models also seem to have inconsistent performance depending on the time of day and how far we are from the next release.
Even with minor automation I feel like I can watch OpenAI and Anthropic engineers fiddling in real-time. Tuesdays behaviour changes by Thursday, 10AMs production isn’t possible at 11:30AM. Nutty.
Which is what I suspect the providers are doing to fit more inference on the same amount of hardware over time.
https://marginlab.ai/trackers/claude-code-historical-perform...
There were at least a couple of these degradation trackers.
I experiment a lot with the open models and I’m getting tired of this trope. I’m not yet convinced that even the best open weight models are equal to Opus from “a few months” ago.
I know what the benchmarks say. I had higher hopes. My real experience just doesn’t match the benchmarks.
I also do a lot of work that even Opus 4.8 struggles with. When even the cutting edge LLMs aren’t all the way there yet, my motivation to switch to something even further behind just isn’t there.
5.2 lives up to the hype. I don't find it to be the best at anything except coding. But for coding... yeah, it lives up to the hype. Not quite Opus 4.8-level, but I would feel comfortable comparing it to 4.5, at least if it had vision capabilities.
That's exactly the problem I have... with Anthropic and "Open""AI"
The moat is so flat, it only gives +1 food and +1 production. +1 gold with a road.
The really interesting thing is that it's typically those very same accounts who were explaining, a few months ago, that thanks to their commercial model they were gaining so much time and producing so much fantastic code.
A few months passes and suddenly the open-source model have caught up with the models that were gaining them so much time and that produced amazing code (in production everywhere for sure btw) but... It's impossible to work with these models.
Rinse and repeat.
The current models, according to them, are basically AGI and they can go fishing while paid subscriptions solve the world's problems.
But when it six months there shall be new closed, pricey, models and when the open ones shall have reach the level of Fable, we'll hear how it's impossible to work in late 2026 on a model that is "only at the level of Fable".
These people should have been snake-oil salesmen (and it could be what they actually are).
Not unusual in the tech space, but this has been basically constantly happening for two years now? I can't imagine the improvements are more than incremental at this point.
Just like the OS ecosystem I think we'll see a similar trajectory with OAI, Anthropic and Google but on a much accelerated time scale. I think the lobbying has begun to lock in their fate for revenue - because none of them give a shit about their users. I do hope, however, that Anthropic continues to over rotate and continue to gimp their models into uselessness. I just asked Opus 4.8 the other day to look at some code as an adversary and summarize areas that should be addressed. Nothing specific and it shut down the conversation. However starting a new prompt and prodding the model from a different angle yielded the results I asked for directly. Pick a lane. Or, don't and continue to lose industry respect and consideration.
10% failure rate would drive me absolutely insane.
not all of us are doing noob shit lol
i would rather spend a few hundred or thousands of dollars a month to make way more, than waste time and still lose to people who are using the latest commercial models which are 3 months ahead of the open source.
what are you even talking about?
Edit: To clarify what I mean by this:
Anyone who uses LLMs for larger-than-small-module code generation, pretend-not-vibecoding (a.k.a spec-driven development), or outright vibecoding, etc., is using an LLM "heavily", IMO.
The appropriate things to use them for is information retrieval, plus as a basic extra signal in debugging, code understanding, quality checks, and so on.
Also, it's not illegal to be incompetent. Most people were incompetent long before LLMs showed up, it's not some rarity.