upvote
I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.

If model A reaches performance level 100 using 100 units of compute using old methods, and you train model B using AttnRes, aiming at performance level 100, it costs you 80 units of compute.

It probably doesn't map precisely, but that's where people are diverging from the claim - it doesn't explicitly say anything about reduced inference or training time, but that's the implicit value of these sorts of things. Less compute to equivalent performance can be a huge win for platforms at scale as well as for local models.

reply
> I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.

This is not what they're getting at; I explained exactly what they're getting at. I mean, your equivalence of "loss" (what authors actually measured) and "performance" is just bizarre. We use benchmarks to measure performance, and the numbers there were like 1-5% better (apart from the GPQA-Diamond outlier).

Do people even read these papers?

reply