upvote
The only real way to see this if you have consistent evals for common usecases in your B2B SAAS product and see if the tricky usecases are being solved. You'd then go down to the cheapest model that can solve the evals.
reply
Yesterday I used Claude on a different laptop that for some reason had an older version of the Claude Code plugin for VSCode and ran Sonnet 4.6 which I initially did not notice. I felt something was really off. Within half an hour I had several situations when I just could not believe how stupid Claude was (although I was only working on a simple static website). Luckily I eventually checked the version, but that experience made it clear to me how big the progress has been recently.
reply