We now use Sonnet 4.6 for a number of internal use cases we wouldn't have considered otherwise.
4.7 was so bad, I locked a bunch of my machines to 4.6.
I haven’t bothered locking the 4.8 machines to 4.6. There was a HN thread a while back where they run swe bench a few times a day and measure success rate and latency. It showed opus getting significantly dumber for the week before a recent launch.
It wouldn’t surprise me if they’re quantizing to improve margins or to hype models in comparative testing in order to defraud investors at IPO.
Or, maybe QA is hard. Anyway, I think they hit a performance wall sometime at or before 4.6.