I've long thought multi-modal LLMs should be strong enough to do RL for TikZ and SVG generation. Maybe Google is doing it.
I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D
---
Qwen 3.5: "A user asks an LLM a question about a fictional or obscure fact involving a pelican, often phrased confidently to test if the model will invent an answer rather than admitting ignorance." <- How meta
Opus 4.6: "Will a pelican fit inside a Honda Civic?"
GPT 5.2: "Write a limerick (or haiku) about a pelican."
Gemini 3 Pro: "A man and a pelican are flying in a plane. The plane crashes. Who survives?"
Minimax M2.5: "A pelican is 11 inches tall and has a wingspan of 6 feet. What is the area of the pelican in square inches?"
GLM 5: "A pelican has four legs. How many legs does a pelican have?"
Kimi K2.5: "A photograph of a pelican standing on the..."
---
I agree with Qwen, this seems like a very cool benchmark for hallucinations.
So we might have an outer alignment failure.
So if there is a single good "pelican on a bike" image on the internet or even just created by the lab and thrown on The Model Hard Drive, the model will make a perfect pelican bike svg.
The reality of course, is that the high water mark has risen as the models improve, and that has naturally lifted the boat of "SVG Generation" along with it.
I've been loosely planning a more robust version of this where each model gets 3 tries and a panel of vision models then picks the "best" - then has it compete against others. I built a rough version of that last June: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...