They are open source and cost waaaay less per token than American models.
I’m using them right now on the $20 Ollama cloud plan and I can actually work with them on my side projects without reaching the limits too much. With Claude Pro $20 plan my usage can barely survive one or two prompts.
And I choose Ollama cloud just because their CLI is convenient to use but their are a lot of other providers for those models so you aren’t even stuck with shitty conditions and usage rules.
To me that’s a pretty bad thing for American economy.
You know, for the rest of the economy that is not big tech.
And investor pumping money in US AI circular money flow just makes innovation everywhere else slower. If not for the GPU/Memory drought running stuff locally (or just in competition cloud) would be far cheaper
I don't know where to begin if you're leading with that. Anything approaching reality is not good for the current administration.
I can name thousands that came out western universities.
I see a lot of rhetoric that only the Chinese labs are contributing to AI while companies like Google and Microsoft are still pulishing their research.
Unfortunately the domain of scientific papers is cluttered with AI slop but still occasional serious paper that i find are from western labs particularly Google Research or Microsoft Research
There is more to American economy than big tech.
And that's precisely why this has started: https://www.wired.com/story/super-pac-backed-by-openai-and-p...
Most of the stock market valuation is big-tech, and most of people's retirements are the stock market, so... if the AI bubble bursts a lot of the US will be affected.
Which is why most of it is a bubble
The author didn't do any of that. They ran each model once on each of 13 (so far) problems and then they chose to highlight the results for the 12th problem. That's not even p-hacking, because they didn't stop to think about p-values in the first place.
LLM quality is highly variable across runs, so running each model once tells you about as much about which one is better as flipping two coins once and having one come up heads and the other tails tells you about whether one of them is more biased than the other.
I reckon we'll have similar suites comparing different aspects of models.
And, at some point, we'll be dealing with models skewing results whenever they detect they're being benchmarked, like it happened before with hardware. Some say that's already happening with the pelican test.
The problem is that hardware benchmarks are harder to game. Yes, hardware manufacturer can make driver tweaks for say particular game to run better but the benchmark is still representable for the workflow user faces and they can't change the most important part, hardware, they can't benchmark gimmick their way in designing hardware
Meanwhile in LLM land the game is to tune it for the current popular set of benchmarks, all while user experience is only vaguely related to those results
Most people who have computers could run inference for even the biggest LLMs, albeit very slowly when they do not fit in fast memory.
On the other hand, training or even fine tuning requires both more capable hardware and more competent users. Moreover the effort may not be worthwhile when diverse tasks must be performed.
Instead of attempting fine-tuning, a much simpler and more feasible strategy is to keep multiple open-weights LLMs and run them all for a given task, then choose the best solution.
This can be done at little cost with open-weights models, but it can be prohibitively expensive with proprietary models.
https://ghzhang233.github.io/blog/2026/03/05/train-before-te...
It just hasn't been widely adopted yet. And it might be in each of their particular interests that it continues to stay so for a while. It's basically like p-hacking.
It's very difficult to justify spending on the their models in a world where DeepSeek costs a fraction and Chinese open models exists and they perform as well as what is considered the state of the art, and it only depends on you adjusting how you use them.
A couple of days ago I canceled ChatGPT and started to try out DeepSeek. Let's see how it goes.
We as an industry cannot determine if one software engineer is objectively better than another, on practically any dimension, so why do we think we can come to an objective ranking of models?
But I'm more optimistic about testing programming models. You can run repeated tests, and compare median performance. You can run long tests, like hundreds of hours, while getting more than a few humans to complete half-day tests is a huge project. And you can do ablation testing, where you remove some feature of the environment or tools and see how much it helps/hurts.
And we can judge developer performance, it just takes 6 months to a year working with a team so it's just hard to get metric
Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al.
But I use Codex and Claude daily (work and hobby respectively). And there are days where one or the other just seems to have gotten up on the wrong side of the bed. Or is just being lazy. Or is suddenly super-powered do everything including what i asked it not to. (To be fair, the same thing happens with myself. :/)
I am convinced that if I was bench-marking, I would be convinced these are different models on different days.
[This conviction may say more about me then about the model.]
> The Word Gem Puzzle is a sliding-tile letter puzzle. The board is a rectangular grid (10×10, 15×15, 20×20, 25×25, or 30×30) filled with letter tiles and one blank space.
Just last week my superior asked to implement that for a customer. /s
Maybe some real, real task would be good? Add sone database, some REST, some random JS framework and let it figure out a full-stack task instead of creating some rectangles?