https://gist.github.com/drorm/7851e6ee84a263c8bad743b037fb7a...
I typically use github issues as the unit of work, so that's part of my instruction.
I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.
Looking forward to trying 5.3.
Every new model overfits to the latest overhyped benchmark.
Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.
But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.
When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.
It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .
AI agents, perhaps? :-D
You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.