undefined

points

[-]

When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?

I ask because:

Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.

But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)

I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.

by irthomasthomas8 hours ago|

parent|

[-]

I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.

And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.

by stared5 hours ago|

prev|

[-]

To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.

by p1esk4 hours ago|

parent|

[-]

What is “Sonnet 3.7 moment”?

by stirfish3 hours ago|

parent|

[-]

Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

by Araopa2 hours ago|

prev|

[-]

So we have to train llms on debugging too, not just how to make things (which you easily train by feeding the outputs).

by 7 hours ago|

prev|

[-]

deleted

by sosborn3 hours ago|

prev|

[-]

This matches my experience with human too FWIW.

by emp173443 hours ago|

parent|

[-]

Why is there always an identical reply like this when anyone criticizes LLMs?

by gowld4 hours ago|

prev|

[-]

It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.

by girvo6 hours ago|

prev|

[-]

Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.