undefined

points

by simonw8 hours ago |

comments

by btown7 hours ago|

[-]

Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

by l_eo4 hours ago|

parent|

[-]

They will start to max this benchmark as well at some point.

by ljm3 hours ago|

parent|

[-]

It's not a benchmark though, right? Because there's no control group or reference.

It's just an experiment on how different models interpret a vague prompt. "Generate an SVG of a pelican riding a bicycle" is loaded with ambiguity. It's practically designed to generate 'interesting' results because the prompt is not specific.

It also happens to be an example of the least practical way to engage with an LLM. It's no more capable of reading your mind than anyone or anything else.

I argue that, in the service of AI, there is a lot of flexibility being created around the scientific method.

by tylervigen3 hours ago|

parent|

[-]

For 2026 SOTA models I think that is fair.

For the last generation of models, and for today's flash/mini models, I think there is still a not-unreasonable binary question ("is this a pelican on a bicycle?") that you can answer by just looking at the result: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

by interstice3 hours ago|

parent|

prev|

[-]

So if it can generate exactly what you had in mind based presumably on the most subtle of cues like your personal quirks from a few sentences that could be _terrifying_, right?

by gabiruh6 hours ago|

parent|

prev|

[-]

It's interesting how some features, such as green grass, a blue sky, clouds, and the sun, are ubiquitous among all of these models' responses.

by derefr4 hours ago|

parent|

[-]

It is odd, yeah.

I'm guessing both humans and LLMs would tend to get the "vibe" from the pelican task, that they're essentially being asked to create something like a child's crayon drawing. And that "vibe" then brings with it associations with all the types of things children might normally include in a drawing.

by btown4 hours ago|

parent|

prev|

[-]

If you were a pelican, wouldn't you want to go cycling on a sunny day?

Do electric pelicans dream of touching electric grass?

by segmondy49 minutes ago|

parent|

prev|

[-]

This is actually a good benchmark, I use to roll my eyes at it. Then I decided to apply the same idea and ask the models to generate SVG image of "something" not going to put it out there. There was a strong correlation between how good the models are and the image they generated. These were also no vision images, so I don't know if you are serious but this is a decent benchmark.

by _joel7 hours ago|

prev|

[-]

Now this is the test that matters, cheers Simon.

by pwython7 hours ago|

prev|

[-]

How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...

by bwilliams185 hours ago|

parent|

[-]

I'd argue that a models ability to ignore/manage/sift through the noise added to the training set from other LLMs increases in importance and value as time goes on.

by nerdsniper6 hours ago|

parent|

prev|

[-]

You're correct. It's not as useful as it (ever?) was as a measure of performance...but it's fun and brings me joy.

by solarized5 hours ago|

prev|

[-]

This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

We need a new, authentic scenario.

by viraptor4 hours ago|

parent|

[-]

Like identifying names of skateboard tricks from the description? https://skatebench.t3.gg/

by alargemoose4 hours ago|

parent|

[-]

I don’t care how practical it may or may not be, this is my new favorite LLM benchmark

by stevage4 hours ago|

parent|

prev|

[-]

I couldn't find an about page or similar?

by viraptor3 hours ago|

parent|

[-]

Here's the public sample https://github.com/T3-Content/skatebench/blob/main/bench/tes...

I don't think there's a good description anywhere. https://youtube.com/@t3dotgg talks about it from time to time.

by hmottestad4 hours ago|

parent|

prev|

[-]

o3-pro is better than 5.2 pro! And GPT 5 high is best. Really quite interesting.

by echelon3 hours ago|

parent|

prev|

[-]

  1. Take the top ten searches on Google Trends 
     (on day of new model release)
  2. Concatenate
  3. SHA-1 hash them
  4. Use this as a seed to perform random noun-verb 
     lookup in an agreed upon large sized dictionary. 
  5. Construct a sentence using an agreed upon stable 
     algorithm that generates reasonably coherent prompts
     from an immensely deep probability space.

That's the prompt. Every existing model is given that prompt and compared side-by-side.

You can generate a few such sentences for more samples.

Alternatively, take the top ten F500 stock performers. Some easy signal that provides enough randomness but is easy to agree upon and doesn't provide enough time to game.

It's also something teams can pre-generate candidate problems for to attempt improvement across the board. But they won't have the exact questions on test day.

by TZubiri1 hours ago|

prev|

[-]

The idea at the time is that it was obviously not part of the training set, now that it's a metric,it's worthless. Try an elephant smoking s cigar on the beach

by RC_ITR4 hours ago|

prev|

[-]

The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.

This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.

by Rudybega3 hours ago|

parent|

[-]

MMLU performance caps out around 90% because there are tons of errors in the actual test set. There's a pretty solid post on it here: https://www.reddit.com/r/LocalLLaMA/comments/163x2wc/philip_...

As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025

by simonw1 hours ago|

parent|

prev|

[-]

It has a wing. Look at the code comments in the SVG!