(simonwillison.net)
I'd say the example actually does (vaguely) suggest that Qwen might be overfitting to the Pelican.
But in terms of making something physically plausible, Opus certainly got a lot closer
I think getting the models to generate realistic and proportional objects is a much harder and important challenge (remember when the models would generate 6 fingers?).
This doesn't hold if some models trained on the benchmark and some didn't, but you can fix this by deliberately fine-tuning all models for the benchmark before comparing them. For more in-depth discussion of this, see https://mlbenchmarks.org/11-evaluating-language-models.html#...
For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.
The Qwen one looks like a 3-tailed, broken-winged, beakless (I guess? Is that offset white thing a beak? Or is it chewing on a pelican feather like it's a piece of straw?) monstrosity not sitting on the seat, with its one foot off the pedal (the other chopped off at the knee) of a malmanufactured wheel that has bonus spokes that are longer than the wheel.
But yeah, it does have a bowtie and sunglasses that you didn't ask for! Plus it says "<3 Flamingo on a Unicycle <3", which perhaps resolves all ambiguity.
It's directly stated in the post that the entire test is meant to be humorous, not taken seriously, only that is has vaguely followed model performance to date. The author also writes that this new result shows that trend has broken..
https://x.com/JeffDean/status/2024525132266688757
If anything, the disastrous Opus4.7 pelican shows us they don't pelicanmaxx
The amount of money you have in the bank may often "increase" or "decrease" but it also goes up and down, spatial. Concepts can be adjacent to each, orthogonal. Plenty more.
So, as models utilize weight more densely with more complex strategies learned during training the patterns & structure of these metaphors might also be deepened. Hmmm... another thing to add to the heap of future project-- trace down the geometry of activations in older/newer models of similar size with the same prompts containing such metaphors, or these pelican prompts, test the idea so it isn't just arm chair speculation.
I’m not sure you’re a bot but this is the stereotypical comment being overly critical of anything where OpenAI is not superior or being overly supportive (see comments on the Codex post today) while clearly not understanding the discussed topic at all.
This is not refutation of astroturfing on HN, but in this case, I doubt it.
Illustrations with SVGs of pelicans riding bicycles will never be useful, because pelicans can't ride bicycles.
I guess initially it would have been a silly way to demonstrate the effect of model size. But the size of the largest models stopped increasing a while ago, recent improvements are driven principally by optimizing for specific tasks. If you had some secret task that you knew they weren't training for then you could use that as a benchmark for how much the models are improving versus overfitting for their training set, but this is not that.
Oh maybe it might continue to iterate on the existing drawing?
That’s so wild
Pelican: saturated!
It's pretty good at finding bugs, but not so good at writing patches to fix them.
But that Opus pelican?