undefined

points

[-]

DeepSeek and GLM (plus Kimi) are at or above Sonnet level wrt. favorable workloads like coding. They're not close to Opus or the latest GPT yet, and Fable is even higher than that. Other workloads relying more on real-world knowledge have them even further behind, and this can't be mitigated without making the model itself bigger and harder to host locally.

by CuriouslyC6 hours ago|

parent|

[-]

Not true. Big models buy you baked in knowledge and long context cohesion. A model can be trained to use search and knowledge base tools more efficiently to mitigate the former, and harnesses/workflows can be designed to push models into small parallel threads to mitigate the latter.

The thing that big models will always bring to the table is the ability to YOLO weak/under-specified prompts, and spend less time in the loop making sure work gets partitioned correctly. For smaller/simpler tasks the P(success) difference isn't that big.

by zozbot2343 hours ago|

parent|

[-]

Knowledge-base access is not very useful in general because a model doesn't have well-defined "known unknowns" that might trigger an agentic search of the outside knowledge base. Plus surfacing knowledge you don't know much about is itself hard.

by dboreham5 hours ago|

parent|

prev|

[-]

These things sound plausible, but have they actually been demonstrated? Wouldn't anyone who succeeded in making such a small but useful LLM be raking in the money now?

by CuriouslyC4 hours ago|

parent|

[-]

Cursor's composer 2.5 is a perfect example. It's right on the heels of the frontier (for coding only) for an order of magnitude cheaper. As much as I've shit on Cursor in the past, I do think the company is well positioned to pick up people getting sticker shock on Anthropic tokens, if they can get their marketing down.

by zozbot2343 hours ago|

parent|

[-]

If that's Kimi-based it would very much be on the larger side of open-weight models (1T params).

by CuriouslyC2 hours ago|

parent|

[-]

It is, but the US labs have been pushing parameters heavily. There was a pullback from big models after GPT4.5 in particular, but with a shift towards emphasis on post training and the good results Google got with scaling Gemini 3, all the labs started to push scaling again, which is the reason the frontier is getting more expensive. So that 1T isn't as big as it sounds, the American frontier is probably sitting at 3-5T at least.

by thepasch10 hours ago|

parent|

prev|

[-]

> They're not close to Opus or the latest GPT yet

Disagreed. GLM-5.1 is easily as good as Opus 4.5 for all the coding purposes I could throw at it, which is the model that kicked this entire hype cycle into overdrive in the first place.

by Cider998610 hours ago|

parent|

prev|

[-]

I've found GLM to be comparable or better than Opus at writing and at a fraction of the cost.

by zozbot23410 hours ago|

parent|

[-]

Writing does not rely on real-world knowledge all that much, other than knowledge of language itself. Even tiny models can achieve that, it's even easier than coding.

by CuriouslyC6 hours ago|

parent|

[-]

The challenge with writing is the lab collapsing the distribution around "tasteful" writing, when the people making decisions about training data aren't able to effectively discriminate it.

by metalspot8 hours ago|

prev|

[-]

The key thing here is that effective intelligence = model capability / cost. If you drive down the cost of inference you can have higher effective capability even with a technically less capable model. There is nothing in Anthropic/OpenAIs general reasoning capabilities that can't be easily done much better with a purpose built harness for a domain specific task.

by kuboble11 hours ago|

prev|

[-]

I think there are at least few question marks.

One being that extrapolating from like 3 data points is hardly science. All trends break at some point.

The other is that the measures to prevent distillation of their models (if it was a secret sauce of Chinese models) could work if nobody is allowed to use them.