undefined

points

[-]

Yeah I can imagine these popular benchmarks get special treatment in the training of new models. I wonder how they would perform for "Elephant riding a car" or "Lion sleeping in a bed"

by simonw11 hours ago|

prev|

[-]

That's why I did the flamingo on a unicycle.

For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.

by furyofantares10 hours ago|

parent|

[-]

It is completely wild to me that you prefer Qwen's flamingo. I think it's really bad and Opus' is pretty good.

by simonw10 hours ago|

parent|

[-]

The Opus one doesn't even have a bowtie.

by furyofantares10 hours ago|

parent|

[-]

The Opus one looks like a flamingo, and looks like it's riding the unicycle. Sitting on the seat. Feet on the pedals.

The Qwen one looks like a 3-tailed, broken-winged, beakless (I guess? Is that offset white thing a beak? Or is it chewing on a pelican feather like it's a piece of straw?) monstrosity not sitting on the seat, with its one foot off the pedal (the other chopped off at the knee) of a malmanufactured wheel that has bonus spokes that are longer than the wheel.

But yeah, it does have a bowtie and sunglasses that you didn't ask for! Plus it says "<3 Flamingo on a Unicycle <3", which perhaps resolves all ambiguity.

by bigyabai9 hours ago|

parent|

[-]

Let's not oversell Opus' output. The Qwen flamingo is flawed but could be easily fixed with 1-2 prompts if you're really upset with it. The Opus SVG is not any better than something that I could make in Inkscape with 3 minutes and sufficient motivation. Calling Opus' flamingo "programmer art" would be an insult to programmers.

by monksy9 hours ago|

parent|

prev|

[-]

Game over opus

by akavel10 hours ago|

parent|

prev|

[-]

r/LocalLlama is now doing a horse in a racing car:

https://redd.it/1slz38i

by solarkraft5 hours ago|

parent|

prev|

[-]

If I (commercially) made models I’d put specific care into producing SVGs of various animals doing (riding) various things ... I find it interesting how confident you seem to be that they’re not.

by simonw1 hours ago|

parent|

[-]

Google Gemini featured a bunch of examples of exactly that in their release video for 3.1 Pro: https://x.com/JeffDean/status/2024525132266688757

by prodigycorp11 hours ago|

parent|

prev|

[-]

To me the opus flamingo is waaaay better than the qwen one. qwen has the better pelican, though.

by dude25071111 hours ago|

parent|

prev|

[-]

Is a flamingo on a unicycle not merely a special case of a pelican on a bicycle?

by luyu_wu7 hours ago|

prev|

[-]

Consider reading the article, which addresses all of the points you raise.

It's directly stated in the post that the entire test is meant to be humorous, not taken seriously, only that is has vaguely followed model performance to date. The author also writes that this new result shows that trend has broken..

by stephbook9 hours ago|

prev|

[-]

They're certainly aware of the test, but a turtle doing a kickflip on a skateboard? I seriously doubt they train their models for that.

https://x.com/JeffDean/status/2024525132266688757

If anything, the disastrous Opus4.7 pelican shows us they don't pelicanmaxx

by bitwize9 hours ago|

parent|

[-]

I think I found the leaked Claude Mythos version of the turtle benchmark: https://www.youtube.com/watch?v=l82XWTKLZuk

by BoorishBears9 hours ago|

prev|

[-]

This is a gag that's long outlived its humor, but we're in a space so driven by hype there are people who will unironically take some signal from it. They'll swear up and down they know it's for fun, but let a great pelican come out and see if they don't wave it as proof the model is great alongside their carwash test.