upvote
That pelican looks like it's in Miami for a crypto conference.
reply
That pelican wears it's sunglasses at night. So it can, so it can keep track of the visions in it's eyes.
reply
It looks quite funny.
reply
Pelican and I need an optometrist urgently
reply
It looks like the starting soon screen of a crypto presentation
reply
It looks like it’s been partying for 60 years based on the wrinkles on its pouch.
reply
That pelican looks like it lost 100k on NFTs and now runs a paid stock-trading group.
reply
Pelican in a white Testarossa.
reply
They're called ClawCons now
reply
Personally, I don't attend them since I figured out I can set up agents to performatively engage in AI-related discussion and events for me, freeing up tons of my time thanks to automation.

Truly: Nothing better than AI tools to brave the challenges and requirements of modern life. "Claude, ride the hype train" is the decisive prompt you need.

reply
It look like the start of a new viral Peliwave aesthetic
reply
and somehow in 1992
reply
sorta looks like the Tron ripoff in the I/O keynote
reply
deleted
reply
deleted
reply
This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.

edit: fixed human hallucination

reply
When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?

I ask because:

Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.

But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)

I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.

reply
I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.

And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.

reply
To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.

reply
What is “Sonnet 3.7 moment”?
reply
Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.
reply
So we have to train llms on debugging too, not just how to make things (which you easily train by feeding the outputs).
reply
deleted
reply
This matches my experience with human too FWIW.
reply
Why is there always an identical reply like this when anyone criticizes LLMs?
reply
It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.
reply
Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.
reply
Forgetting the chainstay is typical of asking random people to draw a bicycle.

https://www.gianlucagimini.it/portfolio-item/velocipedia/

> most ended up drawing something that was pretty far off from a regular men’s bicycle

reply
Asking random people to write SVG gives even worse results
reply
Especially without being able to look at the rendered output! (At least I'd be surprised if modern server-side tool calls regularly include an SVG renderer that can show a rasterized version to the model to iterate on it.)
reply
One of the many things Google was pitching today is that they're going to run things like google search with access to linux container environments to do things like run tool calls... which will presumably be able to rasterize SVGs and show them to the model.

But Simon says he runs these through the API without tool access specifically to prevent that sort of "cheating". I.e. it's an LLM benchmark not an LLM+Harness benchmark.

reply
Although every single render of those has pedals on the correct side as opposed to the Gemini optical illusion back pedal that tries to be both on the other side of the central gear and infront of the back wheel.

Not really a criticism but an interesting point that you would never expect a human to make that mistake even in a bad drawing.

reply
The fact it went for vaporwave styling on its own is very telling.
reply
I feel like it embodies Google's vibe of an uncool guy trying to stay relevant to the youth pretty well.
reply
That's grok. IMO both gemini and grok are the most overlooked models.
reply
If you sort that table by "output token price", it gets really terrifying - going from 4 cents up to $600 =8-O
reply
We've been daily-driving this model for a few weeks and let me tell you, everything it does is a lot. Fast as fuck and it's actually not bad intelligence-wise for a fast model. It basically tries to make up for any intelligence deficit by just doing a lot, checking a lot, retrying a lot.

That's not to say I don't spend my days raging at it... a lot... but it's not that bad. It does tend to ignore completion criteria but it doesn't obviously degrade when being nudged like some models do.

reply
I'm told there is a new Jeff Dean fact inside google: "Jeff Dean manually adjusts the weights in the model just to screw with Simon".
reply
I'm hoping we'll have many of these pelican cyclist pictures collected. Then when all the models can do it well, we'll stop posting about them, and dhen the next generations of AIs train on the data we'll have these canonical archetypes.
reply
I wonder if they added all these unrequested details as an Easter-egg or something? (Since they must be aware of your test by now).
reply
Same old issue with Gemini models trying to "enrich" everything
reply
I can’t help but think that what AI is best at is convincing management that things it creates are full featured which reads to their brains as mature
reply
I enjoy the vaporwave aesthetic it went for. Looks like the pelican has a fish in its mouth too?

https://en.wikipedia.org/wiki/Vaporwave

reply
That sun is very similar to the one from the background of this other top HN post about the OS museum: https://news.ycombinator.com/item?id=48195009
reply
Wow what’s with all the styling? Is it manifestation of google’s styling bias? I like the result for sure. It’s shiny and pretty. But then it’s something I didn’t ask for.
reply
Given your pelican is very famous now, don't you think they are adding instructions to beat this benchmark those days?
reply
Well clearly it's not working lmao
reply
I've found prompts like "capybara with spotted fur and 7 octopus tentacles instead of legs, each a different color, riding a tricycle" etc. to be a better test

Last time I tried, ChatGPT's image generator got the best result.

reply
`<!-- Pelican Eye / Sunglasses (Cool Retro Aviators) -->`

wtf

`<!-- Gold Rim -->`

WTF??

reply
They are just trolling you now
reply
funny that when I try the same prompt, gemini generates an image, not an SVG. something is not right.
reply
That's likely because you're using the Gemini app which has a tool for image generation (nano banana) - I do my tests against the API to avoid any possibility of tool use.
reply
This question makes me wonder if you one shot each pelican or do you run it a few times to get the best one?
reply
I one-shot. I have a long-standing ambition to have each model generate 3x and then get the model (assuming it's a vision model) to pick the best one.
reply
Beats a human by like 10$
reply
So according to Google logic, the value of the pelican is $10-eps. (They applied that reasoning to conversions via adwords)
reply
Love your pelicans, as always. And that one is... Wow.

I noticed the "Synthwave" aesthetic, which is enjoying quite some success since quite some time now, has found its way into AI models (even when it's not in the user's query). It's not the first time I see the sun at sunset with color bands etc. in AI-generated pictures. Don't know why it's now taking on in AI too.

https://en.wikipedia.org/wiki/Synthwave

Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.

To be honest as a kid from the 80s and a teenager from the 90s who grew up with that aesthetic in posters, on VHS tape covers, magazine covers, etc. I do love that style and I love that it made a comeback and that that comeback somehow stayed.

reply
Sythwave vibe hype hit a cultural high point with the release of Far Cry 3 Blood Dragon in 2013.

So it's as relevant and baked-in to today as actual 80s synth-culture was in 2000.

reply
"Look around to look around."
reply
At the keynote today, Sundar Pichai asked Gemini to clone the Dino Game, and it added a synthwave-esque aesthetic.
reply
at a certain point you're gonna need to change your benchmark because this will end up in the model's training set
reply
Gemini were the team most likely to have this in their training set - see https://x.com/JeffDean/status/2024525132266688757 - and yet their latest model still messes up the bicycle frame!
reply
I'm sure that certain point came and went many releases ago.
reply
deleted
reply