undefined

The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.

by suddenlybananas4 hours ago|

parent|

[-]

When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.

by simonw3 hours ago|

parent|

[-]

The embarrassment of getting caught doing that would be expensive.

by throwup2385 hours ago|

prev|

[-]

For every combination of animal and vehicle? Very unlikely.

The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.

by recursive5 hours ago|

parent|

[-]

No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.

by svara5 hours ago|

parent|

[-]

More likely you would just train for emitting svg for some description of a scene and create training data from raster images.

by recursive30 minutes ago|

parent|

[-]

None of this works if the testers are collaborating with the trainers. The tests ostensibly need to be arms-length from the training. If the trainers ever start over-fitting to the test, the tester would come up with some new test secretly.

by zarzavat5 hours ago|

prev|

[-]

You can always ask for a tyrannosaurus driving a tank.

by verdverm6 hours ago|

prev|

[-]

I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too

by 6 hours ago|

prev|

[-]

deleted