upvote
>I think Evans is completely wrong.

I wish there was a case where I find Evans is wrong. As far as my memory served me, I failed to record a single one.

I disagree that Amazon, Meta, Microsoft, and Google are "well" behind. If anything the frontier model advantage seems to be at best 6 - 9 months. And that the Chinese model are all doing well.

One of Steve Jobs's line, "It is a feature, not a product." Even if Apple were a generation behind or 1 year behind frontier model. The advantage of default is enough to hold a lot of its user.

To put it simply, even if OpenAI or Anthropic were better, there is zero chances they would topple Apple in hardware sales, user or ecosystem. On the other hand, even if Apple's AI were 6 - 9 months or a generation behind, most user would settle for it and damage OpenAI / Anthropic.

reply
I use both Claude and Codex and don’t see any meaningful difference between the two. My use case is modeling semi complex physical processes (energy and manufacturing) in code for simulations. I also have to do a good fair of automation via scripting in Python or PowerShell for manipulating data as well as legacy code analysis (C, Fortran, COBOL). Given I provide the models with the information and documentation they need, both perform very similarly. I recently did a full codebase review (for design patterns and vulnerabilities) and both Codex and Fable agreed 100% about the most critical findings. I do very little front end development, although some of my automation scripts have TUIs and again no problem with either Claude or Codex generating them for me. At this point I go with the less expensive, which seems to be Codex. With the $100 plan I rarely hit the limits. With Claude I max out my plan in about 4-6 hours of work.
reply
Did you find much of a difference between Fable and Opus?
reply
> I think Evans is completely wrong. There are only 2 truly frontier models. (at least for now). And Anthropic seems to be leaving OpenAI behind so there might be only 1 in the near future. (which is scary/dangerous)

Truly fascinating ecosystem and community in general, as experiences differ so wildly. Anthropic's models seems far behind OpenAI to me, especially when you get into "Pro" territory, and there doesn't seem to be any worthy competition to Pro Mode available at all.

And this is said with someone who use both platforms, and spend a lot of my day interacting with agents and LLMs in various ways. The interesting part is that probably so do you too, and probably your experience and what you share lines up with what you experience! Yet we come away with basically opposite takeaways :) I don't think either of us are wrong either, somehow.

reply
I agree with what you're saying. I have a Claude plan for work and I prefer using Claude more than any other LLM I've tried. Having recently tried the Codex 100€ plan with GPT-5.5 in high/xhigh, I don't think it's worse that the Opus models, just different.

I've noticed that depending on how you talk to it, you get wildly different outputs. This seems to happen less with Opus: it mostly understand what I want. GPT is often a bit too literal.

Just my two cents.

reply
> I've noticed that depending on how you talk to it, you get wildly different outputs. This seems to happen less with Opus: it mostly understand what I want. GPT is often a bit too literal.

Yeah, exact prompting matters a lot, seemingly more than people think. There is definitely tradeoffs between how literal the models takes the prompts, on one hand it's useful for the model to ignore their own instinct when you know better, so they don't go chasing geese randomly, but on the other hand it's useful sometimes when they self-direct, when you misworded something and it's obvious you meant something different because of the context, and similar things. They're basically good at different things.

Really agree every model isn't equal and they aren't as interchangeable without adjusting how you prompt them as people seem to think.

reply
People use a model as their daily driver, get very familiar with it and it's behavior, and then go and use another model and have a hard time. It's very difficult to separate "the model is bad" from "the model works differently".
reply
> It's very difficult to separate "the model is bad" from "the model works differently"

At which point it’s fair to reject the commoditization label.

Also missing from these discussions are e.g. Qwen, which is at least as good as one back from OpenAI or Anthropic’s frontiers.

reply
When you say "Pro" territory, do you include Fable?
reply
You mean the model that was available for a whole of three days? No, I had played around with it a tiny bit, but not much than that. I guess time will tell if it gets close.
reply
Maybe I’m alone in thinking this but I think the long term victor will be the one that works out pricing best.

Fable might well be a better model but it’s too expensive for everyday AI use. Definitely if we’re talking about the kind of stuff you’re going to want to do on your phone. Even for coding, I’m not going to reach for Fable (well, when I can…) for 95% of the work I do.

I don’t believe a mature AI industry is going to have a one size fits all, single winner.

reply
Yes, and pricing is one of the features of a commodity, because users can jump back and forth between services, it becomes a pricing race to the bottom. Agree also that you don’t need the best model all the time. You could have the most powerful model draft the design, requirements, guidelines, policies or whatnot then get the lower tier models execute it. Then again you can have the most powerful model do the testing and review, and give back feedback, rinse and repeat. Just like in the real world you don’t need an entire staff of lead engineers.
reply