upvote
How much more do you know about pelicans now than when you first started doing this?
reply
Lots more but not because of the benchmark - I live in Half Moon Bay, CA which turns out to have the second largest mega-roost of the California Brown Pelican (at certain times of year) and my wife and I befriended our local pelican rescue expert and helped on a few rescues.
reply
We scaled on "virtually all RL tasks and environments we could conceive." - apparently, they didn't conceive of pelican SVG RL.

I've long thought multi-modal LLMs should be strong enough to do RL for TikZ and SVG generation. Maybe Google is doing it.

reply
At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.

I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D

reply
I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.
reply
It's quite amusing to ask LLMs what the pelican example is and watch them hallucinate a plausible sounding answer.

---

Qwen 3.5: "A user asks an LLM a question about a fictional or obscure fact involving a pelican, often phrased confidently to test if the model will invent an answer rather than admitting ignorance." <- How meta

Opus 4.6: "Will a pelican fit inside a Honda Civic?"

GPT 5.2: "Write a limerick (or haiku) about a pelican."

Gemini 3 Pro: "A man and a pelican are flying in a plane. The plane crashes. Who survives?"

Minimax M2.5: "A pelican is 11 inches tall and has a wingspan of 6 feet. What is the area of the pelican in square inches?"

GLM 5: "A pelican has four legs. How many legs does a pelican have?"

Kimi K2.5: "A photograph of a pelican standing on the..."

---

I agree with Qwen, this seems like a very cool benchmark for hallucinations.

reply
I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.

So we might have an outer alignment failure.

reply
Most people seem to have this reflexive belief that "AI training" is "copy+paste data from the internet onto a massive bank of hard drives"

So if there is a single good "pelican on a bike" image on the internet or even just created by the lab and thrown on The Model Hard Drive, the model will make a perfect pelican bike svg.

The reality of course, is that the high water mark has risen as the models improve, and that has naturally lifted the boat of "SVG Generation" along with it.

reply
How would that work? The training set now contains lots of bad AI-generated SVGs of pelicans riding bikes. If anything, the data is being poisoned.
reply
I like the little spot colors it put on the ground
reply
How many times do you run the generation and how do you chose which example to ultimately post and share with the public?
reply
Once. It's a dice roll for the models.

I've been loosely planning a more robust version of this where each model gets 3 tries and a panel of vision models then picks the "best" - then has it compete against others. I built a rough version of that last June: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

reply
Axis aligned spokes is certainly a choice
reply
What quantization were you running there, or, was it the official API version?
reply
Better than frontier pelicans as of 2025
reply