upvote
I second this; even switching between minor versions of a model, you need to adjust prompts: the new model is better by implying a bunch of things that, when included in the prompt, will overdo that thing.

Assessing quality of output is often not trivial, either. Typically, problems that are solved by offloading something to an LLM are super subjective, and customers “feel” something is different is vulnerable.

We try to quantify output differences by many different similarity metrics. But a lot of energy goes into subjectively evaluating if something still works.

reply
We’re talking about SOTA models like Fable, though.

If you’ve got a product where the budget allows for Fable level token costs, I doubt you wouldn’t have the budget to run your evals again on a cheaper model if Fable was unavailable. I mean it wouldn’t even take that much token volume to turn it into a money saving proposition to do the engineering work to switch to a cheaper model.

Fable is primarily used for human in the loop tasks like coding or office work, not in some backend app unless the company has money to burn and doesn’t care about anything other than using the best model available at the time.

reply
Maybe OP meant switching in a coding harness way? Not an application using AI? I had similar issues like you in the latter case, but in the former it's trivial.
reply
if you’re building on LLMs you gotta have an eval and prompt iteration pipeline, and you ought to be evaling every model release — your competitors will do this, and your users will want the latest and greatest (for frontier tasks) and the cheapest/fastest. So you should already be paying this cost anyways. i guess it depends on your team size and scale but not building this muscle seems like not having continuous delivery for regular code or even like not having tests and ci to merge to main.
reply
SOTA models are typically used for interactive coding and other human in the loop work

> say GPT-4o to GPT-5.2, a transition I just finished on a not too complicated application

Neither of which is close to SOTA, because tasks like these are typically built on a cost conscious manner which tries to keep token costs in check.

I’m primarily responding to all of the commenters who are acting like nobody is going to use American SOTA models for anything because the government interfered with them for a couple weeks. It’s obviously not true, and I expect these models to be oversubscribed instead of avoided like some are claiming.

reply
Vendor diversity is a longstanding risk management principle. For it to work you need to invest in it as you build, not when the rug is pulled.
reply