points
That’s a big if. It’s my experience that models that perform very well on benchmarks do not necessarily perform well in real life.
I’ve mostly started ignoring the benchmarks and run my own evals.