undefined

upvote

points

by simonw5 days ago |

upvote

by sempron645 days ago|

[-]

The pelican has looked very same-y across all frontier models, same color bike, same camera angle, etc. I suspect this challenge is already too embedded in the training data to be a good signal when it succeeds, and maybe even when it fails in pathological ways mirroring existing AI pelicans on the internet.

reply

upvote

by h4ny5 days ago|

[-]

Was it ever a good test? How do you even objectively assess what a good pelican on a bike is anyway?

reply

upvote

by fwipsy5 days ago|

[-]

SVG generation is a good test because it's extremely easy to subjectively assess with visual reasoning where humans are strong. However, pelican on a bike specifically may be overused at this point.

reply

upvote

by Fuzzwah5 days ago|

[-]

The "big beak!" comment in the svg source makes me think it's definitely a gamed "benchmark" at this point.

reply

upvote

by kayge5 days ago|

[-]

Do you think the models are ready for the next level? I believe that would be: Pelican feeding Spaghetti to Will Smith.

reply

upvote

by quantumwoke5 days ago|

[-]

Variations of this comment have been posted for over a year. The pelican has now morphed into part of HN culture rather than a legitimate benchmark, but it's still valuable as a meme.

reply

upvote

by brazukadev5 days ago|

[-]

it is more an example of gaming (the HN system) than meme.

reply

upvote

by stratos1235 days ago|

[-]

I'd be very surprised if this is in the training data given that most models mess it up to this day. E.g. look at the ones from Opus.

reply

upvote

by tripleee5 days ago|

[-]

[flagged]

reply

upvote

by yreg5 days ago|

[-]

I really don't understand what's interesting about this test and why is it always on top.

reply

upvote

by simonw5 days ago|

[-]

It's funny.

reply

upvote

by girvo5 days ago|

[-]

It really is lol

reply

upvote

by mrandish5 days ago|

[-]

As often happens with random oddball things which become traditions in web communities, the replies asking what it is or complaining about it, begin to gain their own humor value.

reply

upvote

by depr5 days ago|

[-]

Same reason you would always see the same top comments on reddit during a certain era.

reply

upvote

by yreg5 days ago|

[-]

That’s what I think too, but we should actively go against such culture here because hn is not reddit.

reply

upvote

by gunsle5 days ago|

[-]

It basically is at this point, if you haven’t noticed. Complete with the same America bad, Elon bad, democrats good midwit progressive politics.

reply

upvote

by clydethefrog4 days ago|

[-]

Almost all Musk related negative news gets [flagged] and never hits the the front page, so there is still a silent base on the other "team" apparently.

reply

upvote

by anhner5 days ago|

[-]

Don't forget EU bad! Because they won't let Apple screw over consumers.

reply

upvote

by replwoacause5 days ago|

[-]

Elon does suck. Objectively.

reply

upvote

by ankit_mishra5 days ago|

[-]

Is this Straw Man and Ad Hominem ?

reply

upvote

by inglor_cz4 days ago|

[-]

It has become a funny meme, much like "My hovercraft is full of eels!"

reply

upvote

by luqtas5 days ago|

[-]

because you can't still ask LLMs to port DOOM to hardware X or Y

reply

upvote

by WithinReason5 days ago|

[-]

It's a meme, and HN loves upvoting memes. Just like Reddit!

reply

upvote

by port115 days ago|

[-]

The ultimate measure of an LLM is whether it can produce a capable image of a pelican riding a bicycle. All other use cases are but a distraction!

reply

upvote

by scrollaway5 days ago|

[-]

Do you seriously have a dedicated “bad takes on AI” hn account?

reply

upvote

by tripleee5 days ago|

[-]

yeah, although I do combine it with "replies to snarky questions" for efficiency

reply

upvote

by jurgenaut235 days ago|

[-]

True that

reply

upvote

by sarreph5 days ago|

[-]

I'm beginning to wonder how much of a useful metric the pelican is because surely the frontier labs must be training their models on pelican-artistry because of how well known your test is now?

reply

upvote

by bensyverson5 days ago|

[-]

Simon has addressed this on virtually every new model release. He also has unpublished alternate prompts. But the larger point is: this is a fun experiment, not a serious and objective benchmark.

reply

upvote

by refulgentis5 days ago|

[-]

It's silly and a joke and a surprisingly good benchmark and don't take it seriously but don't take not taking it seriously seriously and if it's too good we use another prompt but don't actually because then it's not the pelican post and there's obvious ways to better it and it's not worth doing because it's not serious.

Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post.

reply

upvote

by stasomatic5 days ago|

[-]

But what if they are better at flamingos? Are they optimized for pelicans? How about “draw me a four headed owl”? The meme, I get it, but I’d settle for a working bash script, tbh.

reply

upvote

by wongarsu5 days ago|

[-]

I just run my own benchmark for "draw an SVG with $animal driving $vehicle". I won't post my choice of animal and mode of transport, but there are plenty of uncommon combinations to choose from. So far it's a fun and visually intuitive benchmark that does seem to correlate with model capabilities

reply

upvote

by modriano5 days ago|

[-]

I don't know. Just looking at the bike frames (specifically the fact that the AI generated bikes have rather unsteerable front forks), it's clear to me that frontier labs aren't spending much time tuning models to make bikes look coherent, which I assume is an easier task than making a pelican riding a bike look coherent.

reply

upvote

by HaZeust5 days ago|

[-]

I've seen this reply to Simon's benchmark for 2 years running now, and yet you still see improvements and objectively-bad results over time from new releases, even when I'm sure every frontier AI team has/had a person at least partially dedicated to better bicycle-pelican SVG outputs. Alas.

reply

upvote

by sarreph5 days ago|

[-]

I had intended to caveat that: I'm sure I'm not the first person to ask about this!

> you still see improvements

This is expected if they are training their models on it, right?

> objectively-bad results

Keen to learn when this has been the case, i.e. across version increments in major models.

reply

upvote

by simonw5 days ago|

[-]

I've written about this a couple of times, most notably here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

I've been enjoying seeing how the quality of individual models differ based on the amount of reasoning effort you give them. If they were baking an a good pelican you wouldn't expect them to differ so much.

(Google Gemini are the only lab that have very clearly paid attention to the quality of SVG animals-riding-vehicles, see their announcement for Gemini 3.1: https://twitter.com/JeffDean/status/2024525132266688757 )

reply

upvote

by sarreph5 days ago|

[-]

Amazing, thank you Simon! Look forward to reading.

reply

upvote

by mrandish5 days ago|

[-]

Hence it has become a meta-benchmark of relative progress in SVG image generation of a known target which has leaked into the training data and for which "every frontier AI team has/had a person at least partially dedicated to" at least checking if not optimizing.

reply

upvote

by llm_nerd5 days ago|

[-]

I honestly assumed their comment was tongue in cheek humour, because positively no one actually cares how these models generate an SVG pelican riding a bicycle. It's some meme thing that this stuff always appears here.

reply

upvote

by BrokenCogs5 days ago|

[-]

Yeah this is not a real benchmark, it's just a fun tradition everytime a new model is released

reply

upvote

by pelipost1235 days ago|

[-]

"fun" / boringly predictable meme thread with 30+ replies already

reply

upvote

by brazukadev5 days ago|

[-]

It is telling that people need to create throwaway accounts to criticize simonw's behavior in this website.

reply

upvote

[-]

deleted

reply

upvote

by mrandish5 days ago|

[-]

It's evolved from a funny, unserious benchmark to a tradition. When a major new model is released, I now always check the HN thread for Simon's Pelican post. I'll be sad when I don't find it.

When it started, comparing the progress between models was mildly interesting but everyone (including Simon) acknowledges it certainly leaked into the training data long ago.

reply

upvote

by notnullorvoid5 days ago|

[-]

The way I see it the benefit of benchmark isn't to take Simon's results at face value. It's a template for your own benchmarks that are easy to visually evaluate.

reply

upvote

by iLoveOncall5 days ago|

[-]

It was a completely useless test even before the labs trained for it.

reply

upvote

by mrandish5 days ago|

[-]

Yes, it's always been published as a joke. You've explained why it was (and still is) funny meta-commentary on AI benchmarks.

reply

upvote

by ealready_value5 days ago|

[-]

This is the reply I look for in all the new model announcements. Its fun to tell people that I judge models based on pelicans.

reply

upvote

by chorkpop5 days ago|

[-]

Now someone post the link about how it’s impossible for humans to draw a bike from memory.

reply

upvote

by Atheros5 days ago|

[-]

https://link.springer.com/article/10.3758/BF03195929

reply

upvote

by pixel_popping5 days ago|

[-]

This is all we need, that moment the Pelican put the leg behind the frame, we are all doomed.

reply

upvote

by upcoming-sesame5 days ago|

[-]

I also look for this reply because i like seeing the follow-up reply saying that this is not a benchmark anymore because labs have gotten it in their training data.

that reply never failed to come it's basically a meme at this point

reply

upvote

by redox995 days ago|

[-]

It's interesting that they still get the head tube / handle bar part wrong.

reply

upvote

by aarjaneiro5 days ago|

[-]

Or the hands not being wings

reply

upvote

by raffael_de5 days ago|

[-]

I find it quite interesting that while the picture looks better the more advanced the model is, but apparently none so far "understands" that the pelicans legs are on both sides of the bike / top bar.

reply

upvote

by LordDragonfang5 days ago|

[-]

If you scroll to the bottom of the Fable-5 by effort page, Max effort actually gets this correct! (Along with being the only one I've seen so far to make a bicycle frame that matches the shape of what most bikes on Google images look like)

reply

upvote

by wasabi9910115 days ago|

[-]

And the only one linked here that includes a bicycle chain!

reply

upvote

by ethanlipson5 days ago|

[-]

How much money do you think they spent fine-tuning on pelican SVG generation?

reply

upvote

by tarruda5 days ago|

[-]

Not as much as Qwen, since apparently 3.6 35B surpassed Opus 4.7 https://x.com/simonw/status/2044830134885306701

reply

upvote

by csomar5 days ago|

[-]

Probably none. They probably have much better targets to optimize for than an SVG pelican or even SVGs in general.

reply

upvote

by Reebz5 days ago|

[-]

The Max version gets more details right. The bike frame looks good, the chain, the wings are appropriately styled instead of “arms”, and the knee is bent, etc. Obviously we’re hitting marginal returns now, but I see differences.

reply

upvote

by csomar5 days ago|

[-]

Where is the clear improvement on Fable 5? The tail is misplaced.

reply

upvote

by smusamashah5 days ago|

[-]

Can you please compare the code generated by other similar quality pelicans by other models. Code in your first link (Fable 5 Default) looks minimal yet very good.

reply

upvote

by leecommamichael5 days ago|

[-]

Looks like Fable constructed the "max" "looking" pelican of the previous model for the "xhigh" output token count of the previous model.

reply

upvote

by mer_mer5 days ago|

[-]

It's interesting that Gemini 3(.1?) Deep Think is still the best at this task and it's still not really generally available. Maybe Fable could match it at higher effort levels? https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/

reply

upvote

by XCSme4 days ago|

[-]

It also does A LOT better, for my hamster test: https://aibenchy.com/showcase/?q=claude#showcase=6efb87c28e3...

reply

upvote

by rkuska5 days ago|

[-]

Is it possible to use the credits from subscription (https://support.claude.com/en/articles/15036540-use-the-clau...) for fable?

reply

upvote

by 382hi5 days ago|

[-]

I'm pretty sure they're optimizing the models around these sorts of tests.

reply

upvote

by makingstuffs5 days ago|

[-]

I could be tripping but I’m sure that is very similar to the Deepseek one from not long ago. Clearly I am too lazy to go and find it for verification.

reply

upvote

by bergheim5 days ago|

[-]

Anyone care about these pelicans that always come up anymore?

Clearly at this point they are part of the training data.

They even all look sort of ish the same. Daytime, colors,...

reply

upvote

by 1attice5 days ago|

[-]

Without being mean, I encourage you to go look at some of simonw's writing on this topic, which he has addressed repeatedly (and IMO satisfactorily.)

I know because I too had this initial take; however, upon analysis, it is not sound.

reply

upvote

by bergheim5 days ago|

[-]

I know he is an AI influencer that promotes his blog any chance he gets.

I agree as well that he writes many interesting things.

reply

upvote

[-]

deleted

reply

upvote

by benatkin5 days ago|

[-]

The way they talked it up, having both legs on one side of the bike is like walking to the car wash

reply

upvote

by jerryliu125 days ago|

[-]

Personally feel like it could be more ambitious with what it creates.

reply

upvote

by ceroxylon5 days ago|

[-]

Yay, max level actually put one of the legs behind the frame!

reply

upvote

by mercacona5 days ago|

[-]

Why always sunny days?

reply

upvote

by umeshunni5 days ago|

[-]

Pelicans hate biking in the rain (as do I).

reply

upvote

by gavinray5 days ago|

[-]

Fable 5 xhigh actually looks the best to me.

reply

upvote

by purple-leafy5 days ago|

[-]

Do we need a pelican every single time a model is released? Beating a very dead horse.

Fun at first, seems disingenuous now. A site funnel

reply

upvote

by david_shi5 days ago|

[-]

that's a great looking pelican

reply

upvote

by ge965 days ago|

[-]

need more Alex Moulton style bikes

reply

upvote

by lacoolj5 days ago|

[-]

dude, the max version looks like it's finally there. handle bar holding with wings, the left leg is behind the frame while the right is in front of it (correctly).

well done anthropic.

reply

upvote

by arthurcolle5 days ago|

[-]

mediocre pelican. very disappointing

reply

upvote

by kylehotchkiss5 days ago|

[-]

How many barrels of oil are burned per pelican at Fable levels?

reply