undefined

Personally, I don't attend them since I figured out I can set up agents to performatively engage in AI-related discussion and events for me, freeing up tons of my time thanks to automation.

Truly: Nothing better than AI tools to brave the challenges and requirements of modern life. "Claude, ride the hype train" is the decisive prompt you need.

by brindleth6 hours ago|

parent|

prev|

[-]

It look like the start of a new viral Peliwave aesthetic

by egillie8 hours ago|

parent|

prev|

[-]

and somehow in 1992

by verdverm8 hours ago|

parent|

prev|

[-]

sorta looks like the Tron ripoff in the I/O keynote

by 4 hours ago|

parent|

[-]

deleted

by 7 hours ago|

parent|

prev|

[-]

deleted

by irthomasthomas8 hours ago|

prev|

[-]

This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.

edit: fixed human hallucination

by derefr8 hours ago|

parent|

[-]

When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?

I ask because:

Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.

But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)

I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.

by irthomasthomas8 hours ago|

parent|

[-]

I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.

And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.

by stared5 hours ago|

parent|

prev|

[-]

To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.

by p1esk4 hours ago|

parent|

[-]

What is “Sonnet 3.7 moment”?

by stirfish3 hours ago|

parent|

[-]

Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

by Araopa2 hours ago|

parent|

prev|

[-]

So we have to train llms on debugging too, not just how to make things (which you easily train by feeding the outputs).

by 7 hours ago|

parent|

prev|

[-]

deleted

by sosborn3 hours ago|

parent|

prev|

[-]

This matches my experience with human too FWIW.

by emp173443 hours ago|

parent|

[-]

Why is there always an identical reply like this when anyone criticizes LLMs?

by gowld4 hours ago|

parent|

prev|

[-]

It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.

by girvo6 hours ago|

parent|

prev|

[-]

Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.

by tantalor8 hours ago|

prev|

[-]

Forgetting the chainstay is typical of asking random people to draw a bicycle.

https://www.gianlucagimini.it/portfolio-item/velocipedia/

> most ended up drawing something that was pretty far off from a regular men’s bicycle

by et13378 hours ago|

parent|

[-]

Asking random people to write SVG gives even worse results

by lxgr7 hours ago|

parent|

[-]

Especially without being able to look at the rendered output! (At least I'd be surprised if modern server-side tool calls regularly include an SVG renderer that can show a rasterized version to the model to iterate on it.)

by gpm3 hours ago|

parent|

[-]

One of the many things Google was pitching today is that they're going to run things like google search with access to linux container environments to do things like run tool calls... which will presumably be able to rasterize SVGs and show them to the model.

But Simon says he runs these through the API without tool access specifically to prevent that sort of "cheating". I.e. it's an LLM benchmark not an LLM+Harness benchmark.

by Eji17005 hours ago|

parent|

prev|

[-]

Although every single render of those has pedals on the correct side as opposed to the Gemini optical illusion back pedal that tries to be both on the other side of the central gear and infront of the back wheel.

Not really a criticism but an interesting point that you would never expect a human to make that mistake even in a bad drawing.

by VectorLock7 minutes ago|

prev|

[-]

The fact it went for vaporwave styling on its own is very telling.

by smcleod8 hours ago|

prev|

[-]

I feel like it embodies Google's vibe of an uncool guy trying to stay relevant to the youth pretty well.

by dzhiurgis3 hours ago|

parent|

[-]

That's grok. IMO both gemini and grok are the most overlooked models.

by tandr4 hours ago|

prev|

[-]

If you sort that table by "output token price", it gets really terrifying - going from 4 cents up to $600 =8-O

by nrds4 hours ago|

prev|

[-]

We've been daily-driving this model for a few weeks and let me tell you, everything it does is a lot. Fast as fuck and it's actually not bad intelligence-wise for a fast model. It basically tries to make up for any intelligence deficit by just doing a lot, checking a lot, retrying a lot.

That's not to say I don't spend my days raging at it... a lot... but it's not that bad. It does tend to ignore completion criteria but it doesn't obviously degrade when being nudged like some models do.

by dekhn3 hours ago|

prev|

[-]

I'm told there is a new Jeff Dean fact inside google: "Jeff Dean manually adjusts the weights in the model just to screw with Simon".

by karmakaze3 hours ago|

prev|

[-]

I'm hoping we'll have many of these pelican cyclist pictures collected. Then when all the models can do it well, we'll stop posting about them, and dhen the next generations of AIs train on the data we'll have these canonical archetypes.

by bee_rider2 hours ago|

prev|

[-]

I wonder if they added all these unrequested details as an Easter-egg or something? (Since they must be aware of your test by now).

by hydra-f9 hours ago|

prev|

[-]

Same old issue with Gemini models trying to "enrich" everything

by taurath4 hours ago|

prev|

[-]

I can’t help but think that what AI is best at is convincing management that things it creates are full featured which reads to their brains as mature

by nickvec7 hours ago|

prev|

[-]

I enjoy the vaporwave aesthetic it went for. Looks like the pelican has a fish in its mouth too?

https://en.wikipedia.org/wiki/Vaporwave

by khy7 hours ago|

prev|

[-]

That sun is very similar to the one from the background of this other top HN post about the OS museum: https://news.ycombinator.com/item?id=48195009

by sbinnee6 hours ago|

prev|

[-]

Wow what’s with all the styling? Is it manifestation of google’s styling bias? I like the result for sure. It’s shiny and pretty. But then it’s something I didn’t ask for.

by danilocesar4 hours ago|

prev|

[-]

Given your pelican is very famous now, don't you think they are adding instructions to beat this benchmark those days?

by Culonavirus4 hours ago|

parent|

[-]

Well clearly it's not working lmao

by Razengan2 hours ago|

prev|

[-]

I've found prompts like "capybara with spotted fur and 7 octopus tentacles instead of legs, each a different color, riding a tricycle" etc. to be a better test

Last time I tried, ChatGPT's image generator got the best result.

by setgree7 hours ago|

prev|

[-]

``

wtf

``

WTF??

by __mharrison__6 hours ago|

prev|

[-]

They are just trolling you now

by gcgbarbosa8 hours ago|

prev|

[-]

funny that when I try the same prompt, gemini generates an image, not an SVG. something is not right.

by simonw8 hours ago|

parent|

[-]

That's likely because you're using the Gemini app which has a tool for image generation (nano banana) - I do my tests against the API to avoid any possibility of tool use.

by nickmccann8 hours ago|

parent|

[-]

This question makes me wonder if you one shot each pelican or do you run it a few times to get the best one?

by simonw6 hours ago|

parent|

[-]

I one-shot. I have a long-standing ambition to have each model generate 3x and then get the model (assuming it's a vision model) to pick the best one.

by nashashmi9 hours ago|

prev|

[-]

Beats a human by like 10$

by unglaublich8 hours ago|

parent|

[-]

So according to Google logic, the value of the pelican is $10-eps. (They applied that reasoning to conversions via adwords)

by TacticalCoder6 hours ago|

prev|

[-]

Love your pelicans, as always. And that one is... Wow.

I noticed the "Synthwave" aesthetic, which is enjoying quite some success since quite some time now, has found its way into AI models (even when it's not in the user's query). It's not the first time I see the sun at sunset with color bands etc. in AI-generated pictures. Don't know why it's now taking on in AI too.

https://en.wikipedia.org/wiki/Synthwave

Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.

To be honest as a kid from the 80s and a teenager from the 90s who grew up with that aesthetic in posters, on VHS tape covers, magazine covers, etc. I do love that style and I love that it made a comeback and that that comeback somehow stayed.

by kridsdale35 hours ago|

parent|

[-]

Sythwave vibe hype hit a cultural high point with the release of Far Cry 3 Blood Dragon in 2013.

So it's as relevant and baked-in to today as actual 80s synth-culture was in 2000.

by professoretc16 minutes ago|

parent|

[-]

"Look around to look around."

by gowld4 hours ago|

parent|

prev|

[-]

At the keynote today, Sundar Pichai asked Gemini to clone the Dino Game, and it added a synthwave-esque aesthetic.

by holtkam28 hours ago|

prev|

[-]

at a certain point you're gonna need to change your benchmark because this will end up in the model's training set

by simonw8 hours ago|

parent|

[-]

Gemini were the team most likely to have this in their training set - see https://x.com/JeffDean/status/2024525132266688757 - and yet their latest model still messes up the bicycle frame!

by recursive6 hours ago|

parent|

prev|

[-]

I'm sure that certain point came and went many releases ago.

by 9 hours ago|

prev|

[-]

deleted