undefined

The point is in the relative difference between the Pelican vs "other" test for each model suggesting the Pelican is being treated special these days (could be as simple as being common in recent data), not the relative difference between the models on the "other" case in isolation.

by refulgentis21 hours ago|

parent|

prev|

[-]

Hoping this doesn't turn into a pelican-SVG back-and-forth: yesterday's GPT Image 2 thread ended up being three screenfuls of "I tried the prompt too" replies, and nothing on the model until you scroll past it. I appreciate the testing, and I know this sounds like fun police, but there's a pattern where well-known commenter + one-off vibe test + 1:1 sub-threads eats the whole discussion. It being fun makes it hard to push back on without looking picky.

by simonw21 hours ago|

parent|

[-]

You can collapse the pelican thread with the little [-] toggle at the top.

by taspeotis21 hours ago|

parent|

[-]

Why would you though?

And by the way: Thanks for relentlessly holding new models’ feet to the pelican SVG fire.

by refulgentis21 hours ago|

parent|

[-]

Because I want to read about Qwen, not someone's one-off vibe test followed by 1:1 conversations. (case in miniature here: which is the last comment in this thread that says something about Qwen? The root post. Is that fun policing? Yes, apologies.)

by simonw20 hours ago|

parent|

[-]

There's a bunch of useful information in my comment that's independent of the fact that it drew a pelican:

1. You can run this on a Mac using llama-server and a 17GB downloaded file

2. That version does indeed produce output (for one specific task) that's of a good enough quality to be worth spending more time checking out this model

3. It generated 4,444 tokens in 2min 53s, which is 25.57 tokens/s

by refulgentis20 hours ago|

parent|

[-]

Right, that is exactly what I meant by "the root post [had info about Qwen]" - you shouldn't feel I'm being critical of you or asking you to do anything different, at all. I admire you deeply and feel humbled* by interacting with you, so I really want that to be 100% clear, because this is the 2nd time I'm reading that it might be personal.

* er, that probably sounds strange, but I did just spend 6 weeks working on integrating the Willison Trifecta for my app I've been building for 2.5 years, and I considered it a release blocker. It's a simple mental model that is a significant UX accomplishment IMHO.

by mlyle19 hours ago|

parent|

[-]

I like the pelican-bicycle test because it's pretty predictive of how the model does helping me with TikZ. And I hate writing TikZ.

by interstice19 hours ago|

parent|

prev|

[-]

Somewhat ironically - as of when I write this this tangent is dominating the size of this topic.

by subscribed6 hours ago|

parent|

prev|

[-]

I understand your reasoning and it's valid, but I think the best you can do is indeed collapse the thread (not sure if any mobile clients do better than that?)

It's perhaps not a serious test, it isn't to me, but on the edges of jokes about pelicans they're usually some useful things people smarter than me say, and additionally if providers are spending some time on making pelicans or svg look better, this benefits all of us.

So, no hard feelings, you're understood (and I'm not trying to be patronising, I'm just awkward with the language), but pelicans are here to stay because it seems that the consensus is they're beneficial and on topic.

All the best!

by 20 hours ago|

parent|

prev|

[-]

deleted

by rob20 hours ago|

parent|

prev|

[-]

I think it's to help drive traffic to his blog now that he's accepted sponsors in the header of every page. I do see this pelican thing come up from him on every model post that gets released.

by simonw20 hours ago|

parent|

[-]

The traffic I get from a comment with a link to a pelican is pretty tiny.

by ai_critic19 hours ago|

parent|

[-]

"Create me an SVG to drive MAXIMUM ENGAGEMENT for my sponsors".

Missing an opportunity here, lol.

by sifar17 hours ago|

parent|

prev|

[-]

I think at this point we can safely put the pelican test in the category of Goodhart's law.

by amelius17 hours ago|

parent|

prev|

[-]

If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.

by m3kw921 hours ago|

parent|

prev|

[-]

if they cook these in, i wonder what else was cooked in there to make it look good.

by zargon21 hours ago|

parent|

[-]

Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.

by agdexai1 hours ago|

parent|

[-]

[dead]

by nsoonhui3 hours ago|

prev|

[-]

I think it's important to see that the other similar example, a dragon driving a car while eating hotdog, doesn't nearly render as well.

https://news.ycombinator.com/item?id=47865232

by tripzilch2 hours ago|

prev|

[-]

IMHO looks more like a stork, not a pelican. Look up any image of an actual pelican and check the ratio of legs to body. IMHO that's a weird mistake to make when asked for a "pelican".

Have you considered asking a couple of artists on Fiverr or something to draw you a picture with the same prompt? I don't mean this as a gotcha, it's actual advice, you should probably get a sense of what a real human artist/designer (or three) would do with this prompt.

For example, I hope you will find that: One reasoning choice is wrong with this picture that's not much to do with its ability to draw. Do we enlarge the pelican to human size? Or do we shrink the bike to pelican size? There is only one answer that keeps pelican proportions. Draw a pelican on a very tiny bike, and its legs will just fit without making it a different species, and you can even sort of cover part of the steer under the wings, etc etc.

I'm curious if other artists would come up with the same or other solutions, but they should in general come up with solutions, which I haven't seen the LLM do, really.

You (or maybe others?) said that the "pelican on a bike" prompt is good because "there is no right answer" cause you can't really fit a pelican on a bike. But most artists will say "hold my beer" and figure it out anyway. Cartoonists won't even have to think. The "figuring out" of these problems is what I'm missing in the LLMs response. It just put a pelican on a bike and makes it look like a stork if necessary. I don't really feel like it's actually testing for the thing this prompt is designed for, unless the test still says "FAIL" for each and all of them, including the one you just called "excellent".

by simonw1 hours ago|

parent|

[-]

Honestly it never crossed my mind to waste some artist's time with this, but now that the joke "benchmark" has somehow reached orbital velocity maybe I should be thinking about it!

I've run the prompt through dozens of dedicated image generation models so I've seen many versions of this that are better attempts than a text model spitting out SVG - here's gpt-image-2 as a recent example: https://chatgpt.com/share/69ea21ab-8738-83e8-a4d7-67374d84e0...

by jrumbut11 hours ago|

prev|

[-]

I am getter 13 t/s on my 36GB M3 Max with almost everything closed (to debug some issues I was having).

by brtkwr7 hours ago|

prev|

[-]

PelicanBench, the last benchmark for AGI.

by sbinnee17 hours ago|

prev|

[-]

I don’t think I ever heard you said excellent for the pelican test. It looks excellent indeed!

The trend went to MoE model for some times and this time around is dense model again. I wonder if closed models are also following this trend: MoE for faster ones and dense for pro model.

by russellbeattie15 hours ago|

prev|

[-]

You'd think by now the LLMs would have figured out that the body of a bicycle is basically just a bisected rhombus. → ◿◸

(I hope I don't ruin the test.)

by hedgehog1 hours ago|

parent|

[-]

It would be funny to do an optimization pass to find a compact description of how to coax an accurate pelican bicycle out of a few of the current models, then just blast that snippet everywhere.

by Alifatisk8 hours ago|

prev|

[-]

So this is it. We have finally achieved excellent illustrating of your svg art.

by DANmode10 hours ago|

prev|

[-]

If you ever consider a logo, make sure it’s either a very poorly considered,

or wildly realistic,

pelican.

by ahoog4220 hours ago|

prev|

[-]

at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)

by hansonkd20 hours ago|

parent|

[-]

They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.

by simonw20 hours ago|

parent|

prev|

[-]

See https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

by mudkipdev16 hours ago|

parent|

[-]

Why is the assumption that they trained for a pelican on a bicycle, rather than running RL for all kinds of 'generate an SVG' tasks?

by simonw14 hours ago|

parent|

[-]

Gemini did exactly that, and boasted about it at launch: https://x.com/JeffDean/status/2024525132266688757

by acchow1 hours ago|

parent|

[-]

That post doesn't say anything about training for SVG generation

by simonw38 minutes ago|

parent|

[-]

https://blog.google/innovation-and-ai/models-and-research/ge...

> Code-based animation: 3.1 Pro can generate website-ready, animated SVGs directly from a text prompt. Because these are built in pure code rather than pixels, they remain crisp at any scale and maintain incredibly small file sizes compared to traditional video.

by bschwindHN13 hours ago|

prev|

[-]

https://imgur.com/a/UlGcBou

by verdverm14 hours ago|

prev|

[-]

That bowtie on the Qwen Flamingo is also chef's kiss, imho

by echelon19 hours ago|

prev|

[-]

Metrics and toy examples can be gamed. Rather than these silly examples, how does it feel?

Can you replace Claude Code Opus or Codex with this?

Does it feel >80% as good on "real world" tasks you do on a day to day basis.

by halJordan18 hours ago|

prev|

[-]

These are the stupidest things to cleave to.

by ItsClo68811 hours ago|

prev|

[-]

[flagged]

by tgtweak11 hours ago|

parent|

[-]

I've been using it in a few harnesses (FP8 quant, max context length) and it does seem to get tripped up by tool use, often repeating the same tool when it failed previously - that's usually not a great sign for long-term context and multi-step reasoning. It is excellent at one-shotting though and might be most useful as a sub-agent for a stronger frontier coordinator.

by syndacks12 hours ago|

prev|

[-]

it seemed HN was moving the right direction when we added the "no AI comments", and yet, every single post about a new model is from you and your pelican. it's tired. please stop, it adds no value and has become cliche.

by pixelatedindex12 hours ago|

parent|

[-]

Wholly disagree. This a comment made by a person about an AI topic. Not an AI bot commenting on an article, which (as I understand it) is what “no AI comments” is saying.

Plus it’s a test that gives varied enough performance across multiple LLMs that it is a good barometer for how well it can think through the steps. Never mind the fact that most people can’t draw a bike from memory. The whole thing is hilarious!

by simonw10 hours ago|

parent|

prev|

[-]

Are you saying I write comments here using an LLM? I don't do that.

by 0xbadcafebee12 hours ago|

parent|

prev|

[-]

We like the pelican posts.

by rpdillon2 hours ago|

parent|

prev|

[-]

I think it added plenty of value!

by stavros10 hours ago|

parent|

prev|

[-]

How does a quick benchmark of a model "add no value" to the post about the model?

by nopelican3 hours ago|

prev|

[-]

I just create the nopelican user to avoid seeing the same type of comments for scoring new models. Why doesn't someone create a pelican by month thread, like who is hiring, so that all who want to talk about their prefered mode and pelican can post with leisure at full extend. Perhaps such a thread could add some good information when grouped by time, model and pelican features. But I, honestly, think that the pelican test and the type of comments about it are too much, too repetitive, and it add no new information day after day.

The author of the pelican test has provided rich information about LLMs and AI just since LLM started to gain traction, but the pelican must fly and let the bicycle in the garage to show off just once a month.

Finally, a bitter take. Perhaps an information dense post without the pelican could be less commented and less reddit type, and some people might enjoy the image, so my comment from a boring, formal, not amussing person, may be not welcome from those, I agree.

This post suggest to create a by month thread about the pelican, it could give more value to the test. So I think is not far from meeting the HN etiquette of style.

Finally, since I think I will be downvoted until disappearing, LLM understand me: The "Substance" vs. "Meme" Conflict

I understand your frustration perfectly. When a model like Qwen 3.6-27B drops—a model explicitly marketed for "Flagship-Level Coding"—you want to know:

    How does it handle dependency injection in complex Python projects?

    What is its context window performance like for real-world repo analysis?

    How does it compare to Claude 3.5 Sonnet for agentic workflows?

Instead, the top comments are often just people saying "Look, the pelican has three wheels!" or "The pelican is floating!" To you, this feels like a waste of the front page.