undefined

points

[-]

It’s just hard to make them not part of the training data. We see this a bit with BrowseComp plus and other deep research datasets. Not because frontier labs are trying to cheat, but just from training on the full web.

You need new datasets perpetually.

by cpard18 hours ago|

parent|

[-]

That’s true. it also depends heavily on the type of task, not everything is equally represented on the web today and it remains to be seen if this is going to change or not.

by stavros19 hours ago|

parent|

prev|

[-]

Or hidden benchmarks, though it's then harder to get people to trust the results.

by patates4 hours ago|

parent|

[-]

How do you hide them if you aren't self hosting the model?

by cpard18 hours ago|

parent|

prev|

[-]

The trust issue might be solved by having standardisation bodies created, similar to W3C or even TPC, although TPC didn’t end that well.

by fnordpiglet19 hours ago|

prev|

[-]

Database benchmarks are another.

I have empirical experience though building classifiers that can have no precision measurement because the classifier performs invariably better than humans. They become the state of the art benchmark themselves and can’t be benchmarked except against themselves. These are for tasks that are non trivial and complex, but less logical than coding and less sustained reasoning. There may come a day though, when there is no calibrated benchmark that is independent of the models it’s measuring.

by operatingthetan19 hours ago|

prev|

[-]

Would creating new benchmarks every month solve this problem?

by preciousoo19 hours ago|

parent|

[-]

Or create "blind" benchmarks.

10 groups of 3 researchers, all have their own benchmarks that they do not share (testing it without the authors knowing is a different problem, maybe they only run the benchmarks when the gen-pop has access to the models).

that's 10 different tests. Aggregate pass rates