upvote
Step 1: don't trust benchmarks you don't understand - they might measure irrelevant things Step 2: test it on things you know Opus failed

My day-to-day take, for the coding I do (not security related): incremental, modest improvement, if any. Not worth the 2x cost. I've calmly continued to use Opus, happy that it seems like it got an allowance upgrade.

reply
It's a bit odd that you automatically assumed I don't understand the benchmarks.

For most single issues/bugs/tickets, the quality difference wasn't noticeable. But that's like using a sledgehammer to kill a fly. I was using Fable for much more ambitious and complex tasks that require orchestration, and it was crushing it. I described it here: https://news.ycombinator.com/item?id=48505782

So yes, the benchmarks are indeed accurate: where Opus 4.8 would start strong and eventually struggle or run into obstacles, Fable would relentlessly keep working, keep accurate track of all work threads (e.g. multiple inter-dependent issues being worked in parallel by subagents) and would go above and beyond.

reply