upvote
Yeah I can imagine these popular benchmarks get special treatment in the training of new models. I wonder how they would perform for "Elephant riding a car" or "Lion sleeping in a bed"
reply
That's why I did the flamingo on a unicycle.

For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.

reply
It is completely wild to me that you prefer Qwen's flamingo. I think it's really bad and Opus' is pretty good.
reply
The Opus one doesn't even have a bowtie.
reply
The Opus one looks like a flamingo, and looks like it's riding the unicycle. Sitting on the seat. Feet on the pedals.

The Qwen one looks like a 3-tailed, broken-winged, beakless (I guess? Is that offset white thing a beak? Or is it chewing on a pelican feather like it's a piece of straw?) monstrosity not sitting on the seat, with its one foot off the pedal (the other chopped off at the knee) of a malmanufactured wheel that has bonus spokes that are longer than the wheel.

But yeah, it does have a bowtie and sunglasses that you didn't ask for! Plus it says "<3 Flamingo on a Unicycle <3", which perhaps resolves all ambiguity.

reply
Let's not oversell Opus' output. The Qwen flamingo is flawed but could be easily fixed with 1-2 prompts if you're really upset with it. The Opus SVG is not any better than something that I could make in Inkscape with 3 minutes and sufficient motivation. Calling Opus' flamingo "programmer art" would be an insult to programmers.
reply
Game over opus
reply
r/LocalLlama is now doing a horse in a racing car:

https://redd.it/1slz38i

reply
If I (commercially) made models I’d put specific care into producing SVGs of various animals doing (riding) various things ... I find it interesting how confident you seem to be that they’re not.
reply
Google Gemini featured a bunch of examples of exactly that in their release video for 3.1 Pro: https://x.com/JeffDean/status/2024525132266688757
reply
To me the opus flamingo is waaaay better than the qwen one. qwen has the better pelican, though.
reply
Is a flamingo on a unicycle not merely a special case of a pelican on a bicycle?
reply
Consider reading the article, which addresses all of the points you raise.

It's directly stated in the post that the entire test is meant to be humorous, not taken seriously, only that is has vaguely followed model performance to date. The author also writes that this new result shows that trend has broken..

reply
They're certainly aware of the test, but a turtle doing a kickflip on a skateboard? I seriously doubt they train their models for that.

https://x.com/JeffDean/status/2024525132266688757

If anything, the disastrous Opus4.7 pelican shows us they don't pelicanmaxx

reply
I think I found the leaked Claude Mythos version of the turtle benchmark: https://www.youtube.com/watch?v=l82XWTKLZuk
reply
This is a gag that's long outlived its humor, but we're in a space so driven by hype there are people who will unironically take some signal from it. They'll swear up and down they know it's for fun, but let a great pelican come out and see if they don't wave it as proof the model is great alongside their carwash test.
reply