upvote
I wish I had hard evidence but it is mostly an observation. I do use Codex a lot and I felt a drastic change from like one-two months ago to this day.

It used to give me precise answers, "surgical" is how I described it to my friends. Now it generates a lot of slop and plenty of "follow ups". It doesn't give me wrong answers, which is ok, but I've found that things that used to take 3-4 prompts now take 8-10. Obviously my prompting skills haven't changed much and, if anything, they've become better.

This is something that other colleagues have observed as well. Even the same GPT5.4 model feels different and more chatty recently. Btw, I think their version numbers mean nothing, no one can be certain about the model that is actually running on the backend and it is pretty evident that they're continuously "improving" it.

reply
Back in business school they used to tell the story of how makers of razor blades would put a good blade as the first and the last blade in the pack. I suspect the LLM services of doing something like that.
reply
I haven't had the time to fully hash this take out, but a big question in the back of my mind has been - is it possible that AI model improvements come partly from finding overhang in things that look hard and impressive to humans but are actually trivial consequences of the training data? If true, then the observable performance of any widely distributed model could get worse over time as it "mines out" the work that's easy for it to do.
reply