upvote
I feel like this time it is indeed in the training set, because it is too good to be true.

Can you run your other tests and see the difference?

reply
It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":

https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...

reply
compared to your test with GLM 5.1, this indeed looks off

https://xcancel.com/simonw/status/2041646779553476801

reply
Yeah GLM 5.1 did an outstanding job on the possum - better than Opus 4.7 or GPT-5.4 and I think better than Gemini 3.1 Pro too.

But GLM 5.1 is a 1.51TB model, the Qwen 3.6 I used here was 17GB - that's 1/88 the size.

reply
The point is in the relative difference between the Pelican vs "other" test for each model suggesting the Pelican is being treated special these days (could be as simple as being common in recent data), not the relative difference between the models on the "other" case in isolation.
reply
Hoping this doesn't turn into a pelican-SVG back-and-forth: yesterday's GPT Image 2 thread ended up being three screenfuls of "I tried the prompt too" replies, and nothing on the model until you scroll past it. I appreciate the testing, and I know this sounds like fun police, but there's a pattern where well-known commenter + one-off vibe test + 1:1 sub-threads eats the whole discussion. It being fun makes it hard to push back on without looking picky.
reply
You can collapse the pelican thread with the little [-] toggle at the top.
reply
Why would you though?

And by the way: Thanks for relentlessly holding new models’ feet to the pelican SVG fire.

reply
Because I want to read about Qwen, not someone's one-off vibe test followed by 1:1 conversations. (case in miniature here: which is the last comment in this thread that says something about Qwen? The root post. Is that fun policing? Yes, apologies.)
reply
There's a bunch of useful information in my comment that's independent of the fact that it drew a pelican:

1. You can run this on a Mac using llama-server and a 17GB downloaded file

2. That version does indeed produce output (for one specific task) that's of a good enough quality to be worth spending more time checking out this model

3. It generated 4,444 tokens in 2min 53s, which is 25.57 tokens/s

reply
Right, that is exactly what I meant by "the root post [had info about Qwen]" - you shouldn't feel I'm being critical of you or asking you to do anything different, at all. I admire you deeply and feel humbled* by interacting with you, so I really want that to be 100% clear, because this is the 2nd time I'm reading that it might be personal.

* er, that probably sounds strange, but I did just spend 6 weeks working on integrating the Willison Trifecta for my app I've been building for 2.5 years, and I considered it a release blocker. It's a simple mental model that is a significant UX accomplishment IMHO.

reply
I like the pelican-bicycle test because it's pretty predictive of how the model does helping me with TikZ. And I hate writing TikZ.
reply
Somewhat ironically - as of when I write this this tangent is dominating the size of this topic.
reply
I understand your reasoning and it's valid, but I think the best you can do is indeed collapse the thread (not sure if any mobile clients do better than that?)

It's perhaps not a serious test, it isn't to me, but on the edges of jokes about pelicans they're usually some useful things people smarter than me say, and additionally if providers are spending some time on making pelicans or svg look better, this benefits all of us.

So, no hard feelings, you're understood (and I'm not trying to be patronising, I'm just awkward with the language), but pelicans are here to stay because it seems that the consensus is they're beneficial and on topic.

All the best!

reply
deleted
reply
I think it's to help drive traffic to his blog now that he's accepted sponsors in the header of every page. I do see this pelican thing come up from him on every model post that gets released.
reply
The traffic I get from a comment with a link to a pelican is pretty tiny.
reply
"Create me an SVG to drive MAXIMUM ENGAGEMENT for my sponsors".

Missing an opportunity here, lol.

reply
I think at this point we can safely put the pelican test in the category of Goodhart's law.
reply
If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.
reply
if they cook these in, i wonder what else was cooked in there to make it look good.
reply
Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.
reply
[dead]
reply
I think it's important to see that the other similar example, a dragon driving a car while eating hotdog, doesn't nearly render as well.

https://news.ycombinator.com/item?id=47865232

reply
IMHO looks more like a stork, not a pelican. Look up any image of an actual pelican and check the ratio of legs to body. IMHO that's a weird mistake to make when asked for a "pelican".

Have you considered asking a couple of artists on Fiverr or something to draw you a picture with the same prompt? I don't mean this as a gotcha, it's actual advice, you should probably get a sense of what a real human artist/designer (or three) would do with this prompt.

For example, I hope you will find that: One reasoning choice is wrong with this picture that's not much to do with its ability to draw. Do we enlarge the pelican to human size? Or do we shrink the bike to pelican size? There is only one answer that keeps pelican proportions. Draw a pelican on a very tiny bike, and its legs will just fit without making it a different species, and you can even sort of cover part of the steer under the wings, etc etc.

I'm curious if other artists would come up with the same or other solutions, but they should in general come up with solutions, which I haven't seen the LLM do, really.

You (or maybe others?) said that the "pelican on a bike" prompt is good because "there is no right answer" cause you can't really fit a pelican on a bike. But most artists will say "hold my beer" and figure it out anyway. Cartoonists won't even have to think. The "figuring out" of these problems is what I'm missing in the LLMs response. It just put a pelican on a bike and makes it look like a stork if necessary. I don't really feel like it's actually testing for the thing this prompt is designed for, unless the test still says "FAIL" for each and all of them, including the one you just called "excellent".

reply
Honestly it never crossed my mind to waste some artist's time with this, but now that the joke "benchmark" has somehow reached orbital velocity maybe I should be thinking about it!

I've run the prompt through dozens of dedicated image generation models so I've seen many versions of this that are better attempts than a text model spitting out SVG - here's gpt-image-2 as a recent example: https://chatgpt.com/share/69ea21ab-8738-83e8-a4d7-67374d84e0...

reply
I am getter 13 t/s on my 36GB M3 Max with almost everything closed (to debug some issues I was having).
reply
PelicanBench, the last benchmark for AGI.
reply
I don’t think I ever heard you said excellent for the pelican test. It looks excellent indeed!

The trend went to MoE model for some times and this time around is dense model again. I wonder if closed models are also following this trend: MoE for faster ones and dense for pro model.

reply
You'd think by now the LLMs would have figured out that the body of a bicycle is basically just a bisected rhombus. → ◿◸

(I hope I don't ruin the test.)

reply
It would be funny to do an optimization pass to find a compact description of how to coax an accurate pelican bicycle out of a few of the current models, then just blast that snippet everywhere.
reply
So this is it. We have finally achieved excellent illustrating of your svg art.
reply
If you ever consider a logo, make sure it’s either a very poorly considered,

or wildly realistic,

pelican.

reply
at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)
reply
They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.
reply
reply
Why is the assumption that they trained for a pelican on a bicycle, rather than running RL for all kinds of 'generate an SVG' tasks?
reply
Gemini did exactly that, and boasted about it at launch: https://x.com/JeffDean/status/2024525132266688757
reply
That post doesn't say anything about training for SVG generation
reply
https://blog.google/innovation-and-ai/models-and-research/ge...

> Code-based animation: 3.1 Pro can generate website-ready, animated SVGs directly from a text prompt. Because these are built in pure code rather than pixels, they remain crisp at any scale and maintain incredibly small file sizes compared to traditional video.

reply
That bowtie on the Qwen Flamingo is also chef's kiss, imho
reply
Metrics and toy examples can be gamed. Rather than these silly examples, how does it feel?

Can you replace Claude Code Opus or Codex with this?

Does it feel >80% as good on "real world" tasks you do on a day to day basis.

reply
These are the stupidest things to cleave to.
reply
[flagged]
reply
I've been using it in a few harnesses (FP8 quant, max context length) and it does seem to get tripped up by tool use, often repeating the same tool when it failed previously - that's usually not a great sign for long-term context and multi-step reasoning. It is excellent at one-shotting though and might be most useful as a sub-agent for a stronger frontier coordinator.
reply
it seemed HN was moving the right direction when we added the "no AI comments", and yet, every single post about a new model is from you and your pelican. it's tired. please stop, it adds no value and has become cliche.
reply
Wholly disagree. This a comment made by a person about an AI topic. Not an AI bot commenting on an article, which (as I understand it) is what “no AI comments” is saying.

Plus it’s a test that gives varied enough performance across multiple LLMs that it is a good barometer for how well it can think through the steps. Never mind the fact that most people can’t draw a bike from memory. The whole thing is hilarious!

reply
Are you saying I write comments here using an LLM? I don't do that.
reply
We like the pelican posts.
reply
I think it added plenty of value!
reply
How does a quick benchmark of a model "add no value" to the post about the model?
reply
I just create the nopelican user to avoid seeing the same type of comments for scoring new models. Why doesn't someone create a pelican by month thread, like who is hiring, so that all who want to talk about their prefered mode and pelican can post with leisure at full extend. Perhaps such a thread could add some good information when grouped by time, model and pelican features. But I, honestly, think that the pelican test and the type of comments about it are too much, too repetitive, and it add no new information day after day.

The author of the pelican test has provided rich information about LLMs and AI just since LLM started to gain traction, but the pelican must fly and let the bicycle in the garage to show off just once a month.

Finally, a bitter take. Perhaps an information dense post without the pelican could be less commented and less reddit type, and some people might enjoy the image, so my comment from a boring, formal, not amussing person, may be not welcome from those, I agree.

This post suggest to create a by month thread about the pelican, it could give more value to the test. So I think is not far from meeting the HN etiquette of style.

Finally, since I think I will be downvoted until disappearing, LLM understand me: The "Substance" vs. "Meme" Conflict

I understand your frustration perfectly. When a model like Qwen 3.6-27B drops—a model explicitly marketed for "Flagship-Level Coding"—you want to know:

    How does it handle dependency injection in complex Python projects?

    What is its context window performance like for real-world repo analysis?

    How does it compare to Claude 3.5 Sonnet for agentic workflows?
Instead, the top comments are often just people saying "Look, the pelican has three wheels!" or "The pelican is floating!" To you, this feels like a waste of the front page.
reply