What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.
Not quite the same, but it did remind me of it.
Similar story with the whole networking stack. I haven’t used Unity in years now after it being my main work environment for years, but the sour taste it left in my mouth by moving everything that worked in the engine into plugins that barely worked will forever remain there.
Im sure its partly skill issue
Who knows, I might arrive before I depart.
"Ok, here is the translation:"
'we don't want to offer support'Really, the economics makes no sense, but that's what they're doing. You can't have a consistent model because it'll pin their hardware & software, and that costs money.
Sure, makes total sense guys.
I don't know, this feels unnecessarily nitpicky to me
It isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.
Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.
But generally: These are not consumer facing products and I agree that someone who uses the API should be able to figure out the price point of different models.
Why are you using the same model after a month? Every month a better model comes out. They are all accessible via the same API. You can pay per-token. This is the first time in, like, all of technology history, that a useful paid service is so interoperable between providers that switching is as easy as changing a URL.
Sample size was 1000 jobs per prompt/model. We run them once per month to detect regression as well.
Sounds like someone who's responsible, on the hook, for a bunch of processes, repeatable processes (as much as LLM driven processes will be), operating at scale.
Just in the open, tools like open-webui bolts on evals so you can compare: how different models, including new ones, perform on the tasks that you in particular care about.
Indeed LLM model providers mainly don't release models that do worse on benchmarks—running evals is the same kind of testing, but outside the corporate boundary, pre-release feedback loop, and public evaluation.
https://chatgpt.com/share/69aa1972-ae84-800a-9cb1-de5d5fd7a4...
It's really nice to see Google get back to its roots by launching things only to "beta" and then leaving them there for years. Gmail was "beta" for at least five years, I think.
I guess that's true, but geared towards API users.
Personally, since "Pro Mode" became available, I've been on the plan that enables that, and it's one price point and I get access to everything, including enough usage for codex that someone who spends a lot of time programming, never manage to hit any usage limits although I've gotten close once to the new (temporary) Spark limits.
Also their pricing based on 5m/1h cache hits, cash read hits, additional charges for US inference (but only for Opus 4.6 I guess) and optional features such as more context and faster speed for some random multiplier is also complex and actually quiet similar to OpenAI's pricing scheme.
To me it looks like everybody has similar problems and solutions for the same kinds of problems and they just try their best to offer different products and services to their customers.
The version numbers are mostly irrelevant as afaik price per token doesn't change between versions.
https://en.wikipedia.org/wiki/Masterpiece
https://en.wikipedia.org/wiki/Sonnet
https://en.wikipedia.org/wiki/Haiku
They dropped the magnum from opus but you could still easily deduce the order of the models just from their names if you know the words.
The pricing is more complex but also easy, Opus > Sonnet > Haiku no matter how you tweak those variables.
naming things
cache invalidation
off by one errors
Out of tokens until end of month
I also don't believe there is any value in trying to aggregate consumers or businesses just to clean up model makers names/release schedule. Consumers just use the default, and businesses need clarity on the underlying change (e.g. why is it acting different? Oh google released 3.6)