(10 points on the benchmark, or a relative increase of over 20%)
https://news.ycombinator.com/item?id=44630724
TFA on the other hand tests two things at once: mixing models, and "fuse a model with itself",! the latter being just test time compute. e.g. Opus was able to match Fable on TFA, at the cost of costing twice as much money (and presumably time).
These two dimensions are orthogonal but can be combined for further gains.
It's not clear that every task benefits from it though. The only benched deep research, and their results are a bit weird. (e.g. they have DeepSeek outranking frontier models.)
More research needed!
I would love to hear why they have created it, what was the business case, what this is going to serve? As you said, this is pretty easy to replicate