(hnup.date)
One thing for sure is that while Claude is currently taking the #1 spot in mentions, it carries a lot of negative sentiment due to API pricing policies and frequent server downtime. On the other hand, the runner-up, GPT-5.5, actually seems to have more positive feedback.
Personally, my experience with Codex wasn't as good as with Claude Code (Codex freezes on Windows more often than you'd expect), so this is a bit surprising. That said, the more defensive GPT is definitely better in terms of sheer code-writing capability. However, GPT actually has quite a few issues with text corruption when generating in Korean or Chinese—something English-speaking users probably don't notice. In terms of model capabilities, when given the same agent.md (CLAUDE.md) file, I think GPT is better at writing code, while Claude is better at writing text during code reviews.
Looking at the bottom right, Qwen and DeepSeek are open-source, so they are largely mentioned in the context of guarding against vendor lock-in, which drives positive sentiment. Considering that Hacker News occasionally shows negative sentiment toward China, the fact that they are viewed this positively—unlike US models—shows that being open-source is a massive advantage in itself.
Anyway, one thing for sure is that Gemini is pretty much unusable.
Gemini is not at all unusable. It is quite usable for the tasks it excels at - to the point that it is the top pick for many tasks and I spend more money there than elsewhere. On the other hand it responds quite differently from the other major models - so that claude and gpt on one hand are similar and gemini requires a different approach. In my opinion people who think gemini is worthless have not learned how to prompt it correctly. Again, it's intuitive and watching concrete response difference due to small input changes, but if I had to summarize it shows its google books / google scholar roots.
I have started experimenting with qwen more than deepseek, but I have not had good results yet. Given the good press I presume I will learn how to interact with it for better results.
Curious if others have similar experiences in comparing models usefully, or if most don't bother with this, or do something else? I mainly use models for highly focused specialty tasks, so this fine tuning makes the difference between usable and unusable. I don't yet have the luxury of defining my preferred workflow and finding the tool for the task. Everything just breaks almost immediately if I try to shoehorn into my preferred flow.
And what use cases do you think it’s best suited for?
They are cheaper! All signals point to them staying cheaper because they are built more sustainably. Also, some of the latest entries can run on 1 GPU! Literally available at your desktop where there can be no service interruptions. Not even network latency. People are one and few shotting little games for 0 dollars because they bought a GPU to play video games this year. To me that's an unbeatable value. Once the tooling catches up and a few more model releases, it could change everything completely.
Its really a cost effective model.
Of course, when I tried it on something else it rewrote every line in the file for no good reason, applied changes directly when I told it just to plan, etc.
So maybe it has one strength.
Essentially, I use it when I truly only need an "Advanced Google" to find lots of document or website references based on only some partial understanding of "X". I don't like having it do anything with those things. Only when I need to find those things.
Claude, especially, seems to absolutely hate doing research when there are major ambiguities in your question. It's the only one of the major models that keeps playing 20 questions with me when I neither know nor care what the answers to those questions are.
If I have a task that requires parsing through swathes of irregular data that traditional ml would choke on (or require an intermediate training step ala bigquery), I have gotten much better results from Gemini than the other two.
Ha! I find that Gemini is quite useful - if only because I am forced to use it (on my personal projects) because it's the only one that has unlimited interaction for "free"
It has its limitations, yes, but so does Claude (which I am leaning on too heavily at work at the moment)
maybe cache this thing my guy you're just doing a bunch of reads
---
constructive suggestions
- you have a pretty cheap process here, and HN exposes historical posts by date. perhaps worth running this back the last 2 years to reconstruct a history of sentiment?
- introduce alternative sorts around the net positive/negative sentiments and absolute positive sentiments, similar to State of JS (https://stateofjs.com) - you'll see the gpt outperformance more
- matching of Opus 4.7 and Opus Latest seems sus?
Backfilling it further is definitely in the cards, I just want to stabilize the methodology first.
If a comment just mentions Opus without being more specific and in the absence of relevant context clues, it gets mapped to Opus Latest. So it's saying more about the model family than a specific version. Tbh I'll probably remove all "-latest" data points going forward, as I mentioned in another comment.
Consider keeping this data point but instead calling it something like "Opus Unspecified". Let the user decide how to interpret it.
I am upset because now anthropic, openai, meta, etc will continue their smear campaigns here. But I am also happy because it will make HN less useful when they do.
Everything is a give and take I guess. Excited to see where the equilibrium sits
What I want is more fully open models where everything is shared. Data, training algorithms, weights. That way we can figure out if we should trust it.
I think it's also unfair to say their success is solely due to stealing data. They are contributing a lot of advances to the literature about what they are doing. The proof is in the results we have 27b models you can vibe code with. Not 1t+
It's murky sure. But there are smear campaigns about how people can't trust China too. There's some truth to that too but we can't trust the US either so local models are an interesting way for China to offer us some level of sovereignty.
1.) Opus 4.7 via the API is great. Unlike 4.6, I have found the model to degrade far less beyond 120k, even 600k can be relied upon. Task Inference, Task Evaluation, Task Adherence, tool calling, all do very well on my evals. I did however for the first time in a while end my Claude Max subscription because, after their post-mortem [0] I for the first time saw true, reproducible, incredibly frustrating regressions in model output when using Claude Code.
Yes, this was after their post in the last week of April 26 and yes, I have been fortunate enough to never have been affected by regressions up to this point. The model via API with other harnesses provides consistent, useful and high quality output, but the recent changes have become an avalanche of "this requires more than two changes so we should table this for later" and "it seems the subagent finding was wrong and this is not actionable" with a healthy mix of suggestions that clearly are there to safe tokens, but go against clear instructions. I understand that they are compute constraint but as someone who until recently has never maxed out their weekly and nearly never their 5 hour limits on the Max 5x plan, these changes are not just frustrating (and make reasonable users think the model was nerfed rather than the harness) but also cost more as I now have to prompt four times and thousands of tokens more for a task that previously the same harness with the same model did far more efficiently. I regularly check the numbers and yes, by trying to be more efficient, they made what I am costing them far higher, going beyond what I pay for the subscription. Ironically, and I must emphasise this, I did not have regressions before, which suggests some major luck in A/B testing at least.
2.) GPT-5.5 is amazing, a true jump I have not seen since GPT-5 and far more than even GPT-5.4 is approaching the way Anthropic models have handled task inference, which also has lead to far reduced reasoning needs. I very much like it, with the exception of the reduced context window and degradation in compaction. GPT-5.4 did compaction so consistently well, that the 272k standard window before the price increase was of no concern and going beyond it was reliably possible. With GPT-5.5, the cost per token is doubled and compaction is far less reliable, leading to loss of task adherence and preventing task completion in certain cases. I am aware GPT-5.5 is a new pretrain (though how new given frontend is still abhorrently poor and has been since Horizon Alpha which I maintain was worse than GPT-4.1) and am hopeful they can integrate some of the solutions they were leveraging for GPT-5.4 compaction, but until then, it remains a model great for very challenging and complex blockers, but not a GPT-5.4 drop-in replacement.
3.) Kimi K2.6 is great for the API price, efficient, fast and does very well on all my metrics. I very much like it, far more so than Deepseek V4 Pro, any Qwen, Z.AI or Meta model and I truly am impressed. Composer 2 has shown how you can take the base even further given the right data and if I had to pay exclusively API pricing without any subscriptions, I think I'd have no problem leaning on K2.6 for most needs. It is what I'd love to see from Mistral or Apple and shows that one can't just succeed in a few narrow areas (Z.AI with tool calling, Deepseek with world knowledge, Mistral with being European, etc.) but provide a balanced product across all areas as an open weight company. I just wish they'd expose Agent Swarms via the API, there are a few experiments I'd like to try.
[0] https://www.anthropic.com/engineering/april-23-postmortem
It's actually pretty difficult to find a good comparison model because there isn't one. Again, a 14/28 cent in/out model, ignoring cache, it scores just below GPT 5.4 Mini-xhigh (75/450) and Gemini 3 Flash (50/300) in intelligence. It's similar to Gemma 4 31B in some metrics (13/38) including cost, so it's not completely unheard of, but it's pretty notable that virtually everything else in the same region in most benchmarks are going to cost at least 5 times more (much, much more in very output-heavy contexts)
I wouldn't use Gemini 3 Flash or GPT 5.4 mini for anything except the most trivial work, although both are useful for basic exploratory work.
So I'm using a heavy model for the bulk of the work and the cost of that so far outweighs the light model that the light model cost is effectively irrelevant.
If one likes a model then it's capable of one-shotting entire apps.
Otherwise it's "only suitable for the most trivial tasks".
Never in between.
Personally my opinion in this regard is highly consistent over time.
Subjectively, it seemed like DeepSeek V4 Pro had the highest hype/performance ratio (meaning high hype for lower performance). Whereas MiMo V2.5 Pro didn't get much attention despite being the top dog in the open weights world, not even an honorable mention in your chart :( ...
Searching for it on HN shows very few results, that's why it's not showing up in the analysis yet. But it might in the future, once it gains traction.
I'll keep an eye on it, thanks for bringing it up!
Edit: Done
In the meantime, you can hover or tap the columns to see the full model names.
It's way too important a piece of information not to have it visible.
I thought I'd keep these as a rating for model families rather than specific models. But tbh it's probably better to remove them, too confusing.
And it's probably a good idea to create a list of model release dates, so older comments can't accidentally map to models that weren't released yet.
The context would be really nice to have, but reading the comments myself, it often just isn't very clear what exactly users are building or which programming language they are using.
I think analyzing more comments is promising. If you get enough data, you can generalize across use cases and get more meaningful ratings. The obvious lever is including more posts, although it might hit diminishing returns. I'll play around with it.
For the context, I want to try giving Gemini a "scratch pad", where it can note down strengths and weaknesses per model that it finds in the comments. Something like "some users say that model x is good for writing tests". Then on each run, I let it update the scratch pad and publish the results as more of a qualitative analysis.
For the wording, I'd like to keep a certain amount of click bait, sorry ;)
I saw you're using Gemini for the sentiment rating (which I guess you picked because it's not often mentioned and thus "neutral"? lol)
But would be interesting to get more details overall
Now it seems like it's come circle from the other direction, too. We always had fandom elements in computing nerd culture. Editor wars. Language wars. Framework wars. Now that software tooling has become nearly human-like, mercurial, unpredictable, inconsistent in performance and experience from week to week, software developers have turned into sports scouts and ESPN talking heads, going so far as to make continually updating live power rankings the way commentators try to predict in season which team is looking most like they'll win the championship that year. You're in the position talent evaluators were in roughly the late 90s, relying mostly on eye test and rough proxy measures of raw potential. Simon Willison applies the pelican test the way draft combines put athletes through shuttle drills and test vertical leap to try and predict how well they'll do in real gameplay.
It leaves me wondering when we'll have the Bill James style analytics breakthrough in software talent evaluation or if such a thing is even possible. At least with athletes, practice can make them better and injury and age can make them worse, but you can't just silently swap out an entirely different mind and body under the same name and face. You guys are trying to assess the performance of constantly moving targets that can and do change capabilities and characteristics on a daily basis.
I've been experimenting with the 26B-A4B model with some surprisingly good results (both in inference speed and code quality — 15 tok/s, flying along!), vs my last few experiments with Devstral 24B. Not sure whether I can fit that 35B Qwen model everybody's so keen on, on my 32GB unified RAM.
However I think I may be in the minority of HN commenters exploring models for local inference.
The technical abilities and usage are derived from the commenters usage reflections.
kimi...?
https://github.com/raine/claude-code-proxy
https://api-docs.deepseek.com/quick_start/agent_integrations...