undefined

upvote

points

by simonw7 hours ago |

upvote

by embedding-shape7 hours ago|

[-]

It's an excellent demonstration of the main issue I have with the Gemini family of models, they always go "above and beyond" to do a lot of stuff, even if I explicitly prompt against it. In this case, most of the SVG ends up consisting not just of a bike and a pelican, but clouds, a sun, a hat on the pelican and so much more.

Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.

reply

upvote

by mullingitover6 hours ago|

[-]

> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

Just asking "Explain what this service does?" turns into

[No response for three minutes...]

+729 -522

reply

upvote

by cowmoo7286 hours ago|

[-]

it's also so aggressive about taking out debug log statements and in-progress code. I'll ask it to fill in a new function somewhere else and it will remove all of the half written code from the piece I'm currently working on.

reply

upvote

by chankstein386 hours ago|

[-]

I ended up adding a "NEVER REMOVE LOGGING OR DEBUGGING INFO, OPT TO ADD MORE OF IT" to my user instructions and that has _somewhat_ fixed the problem but introduced a new problem where, no matter what I'm talking to it about, it tries to add logging. Even if it's not a code problem. I've had it explain that I could setup an ESP32 with a sensor so that I could get logging from it then write me firmware for it.

reply

upvote

by sd95 hours ago|

[-]

If it's adding too much logging now, have you tried softening the instruction about adding more?

"NEVER REMOVE LOGGING OR DEBUGGING INFO. If unsure, bias towards introducing sensible logging."

Or just

"NEVER REMOVE LOGGING OR DEBUGGING INFO."

reply

upvote

by bratwurst30006 hours ago|

[-]

"I've had it explain that I could setup an ESP32 with a sensor so that I could get logging from it then write me firmware for it." lol did you try it? This so far from everything ratinonal

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by BartShoot5 hours ago|

[-]

if you had to ask it obviously needs to refactor code for clarity so next person does not need to ask

reply

upvote

by quotemstr6 hours ago|

[-]

What. You don't have yours ask for edit approval?

reply

upvote

by girvo2 hours ago|

[-]

The depressing truth is most I know just run all these tools in /yolo mode or equivalents.

Because your coworkers definitely are, and we're stack ranked, so it's a race (literally) to the bottom. Just send it...

(All this actually seems to do is push the burden on to their coworkers as reviewers, for what it's worth)

reply

upvote

by embedding-shape6 hours ago|

[-]

Who has time for that? This is how I run codex: `codex --sandbox danger-full-access --dangerously-bypass-approvals-and-sandbox --search exec "$PROMPT"`, having to approve each change would effectively destroy the entire point of using an agent, at least for me.

Edit: obviously inside something so it doesn't have access to the rest of my system, but enough access to be useful.

reply

upvote

by well_ackshually3 hours ago|

[-]

>Who has time for that?

People that don't put out slop, mostly.

reply

upvote

by embedding-shape2 hours ago|

[-]

That's another thing entirely, I still review and manually decide the exact design and architecture of the code, with more care now than before. Doesn't mean I want the UI of the agent to need manual approval of each small change it does.

reply

upvote

by quotemstr5 hours ago|

[-]

I wouldn't even think of letting an agent work in that made. Even the best of them produce garbage code unless I keep them on a tight leash. And no, not a skill issue.

What I don't have time to do is debug obvious slop.

reply

upvote

by kees994 hours ago|

[-]

I ended up running codex with all the "danger" flags, but in a throw-away VM with copy-on-write access to code folders.

Built-in approval thing sounds like a good idea, but in practice it's unusable. Typical session for me was like:

  About to run "sed -n '1,100p' example.cpp", approve?
  About to run "sed -n '100,200p' example.cpp", approve?
  About to run "sed -n '200,300p' example.cpp", approve?

Could very well be a skill issue, but that was mighty annoying, and with no obvious fix (options "don't ask again for ...." were not helping).

reply

upvote

by embedding-shape2 hours ago|

[-]

I keep it on a tight leash too, not sure how that's related. What gets edited on disk is very different from what gets committed.

reply

upvote

by mullingitover4 hours ago|

[-]

Ask mode exists, I think the models work on the assumption that if you're allowing edits then of course you must want edits.

reply

upvote

by kylec6 hours ago|

[-]

"I don't know what did it, but here's what it does now"

reply

upvote

by moffkalast2 hours ago|

[-]

I've seen Kimi do this a ton as well, so insufferable.

reply

upvote

by SignalStackDev6 hours ago|

[-]

[dead]

reply

upvote

by h14h5 hours ago|

[-]

Would be really interesting to see an "Eager McBeaver" bench around this concept. When doing real work, a model's ability to stay within the bounds of a given task has almost become more important than its raw capabilities now that every frontier model is so dang good.

Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.

reply

upvote

by cglan5 hours ago|

[-]

being TOO steerable is another issue though.

Codex is very steerable to a fault, and will gladly "monkey paw" your requests to a fault.

Claude Opus will ignore your instructions and do what it thinks is "right" and just barrel forward.

Both are bad and papering over the actual issue which is these models don't really have the ability to actually selectively choose their behavior per issue (ie ask for followup where needed, ignore users where needed, follow instructions where needed). Behavior is largely global

reply

upvote

by kees994 hours ago|

[-]

I my experience Claude gradually stops being opinionated as task at hand becomes more arcane. I frequently add "treat the above as a suggestion, and don't hesitate to push back" to change requests, and it seems to help quite a bit.

reply

upvote

by cglan1 hours ago|

[-]

Yeah that happens to me too. It’s hard to know where it’s going to break off and follow instructions too well vs use it as a tip. Idk it’s all tiring

reply

upvote

by h14h2 hours ago|

[-]

For sure. I imagine it'd be pretty difficult to evaluate the "correct" amount of steer-ability. You'd probably just have to measure a delta in eagerness on a single same task between when given highly-specified prompts, and more open-ended prompts. Probably not dissimilar from how artificialanalysis.ai does their "omniscience index".

reply

upvote

by enobrev7 hours ago|

[-]

I have the same issue. Even when I ask it to do code-reviews and very explicitly tell it not to change files, it will occasionally just start "fixing" things.

reply

upvote

by mikepurvis6 hours ago|

[-]

I find Copilot leans the other way. It'll myopically focus its work in the exact function I point it at, even when it's clear that adding a new helper would be a logical abstraction to share behaviour with the function right beside it.

Overall, I think it's probably better that it stay focused, and allow me to prompt it with "hey, go ahead and refactor these two functions" rather than the other way around. At the same time, really the ideal would be to have it proactively ask, or even pitch the refactor as a colleague would, like "based on what I see of this function, it would make most sense to XYZ, do you think that makes sense? <sure go ahead> <no just keep it a minimal change>"

Or perhaps even better, simply pursue both changes in parallel and present them as A/B options for the human reviewer to select between.

reply

upvote

by Yizahi1 hours ago|

[-]

Asking LLM programs to "not do the thing" often results in them tripping and generating output including that "thing", since those are simply the tokens which will enter the input. I always try to rephrase query the way that all my instructions have only "positive" forms - "do only this" or "do it only in that way" or "do it only for those parameters requested" etc. Can't say if that helps much, but it is possible.

reply

upvote

by neya6 hours ago|

[-]

> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

This has not been my experience. I do Elixir primarily and Gemini has helped build some really cool products and massive refactors along the way. And it would even pick up security issues and potential optimizations along the way

What HAS been an issue constantly though was randomly the model will absolutely not respond at all and some random error would occur which is embarrassing for a company like Google with the infrastructure they own.

reply

upvote

by embedding-shape6 hours ago|

[-]

Out of curiosity, do you have any public projects (with public source code) you've made exclusively with Gemini, so one could take a look? I've tried a bunch of times to use Gemini to at least finish something small but I always end up sufficiently frustrated to abort it as the instruction-following seems so bad.

reply

upvote

by apitman5 hours ago|

[-]

This matches my experience using Gemini CLI to code. It would also frequently get stuck in loops. It was so bad compared to Codex that I feel like I must have been doing something fundamentally wrong.

reply

upvote

by msteffen4 hours ago|

[-]

> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

Not like human programmers. I would never do this and have never struggled with it in the past, no...

reply

upvote

by embedding-shape4 hours ago|

[-]

Fairer comparison would be against other models, which are typically better at instruction following. You say "don't change anything not explicitly mentioned" or "Don't add any new code comments" and they tend to follow that.

reply

upvote

by tyfon6 hours ago|

[-]

I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".

That helped quite a bit but it would still go off on it's own from time to time.

reply

upvote

by JLCarveth5 hours ago|

[-]

Every time I have tried using `gemini-cli` it just thinks endlessly and never actually gives a response.

reply

upvote

by gavinray7 hours ago|

[-]

Do you have Personalization Instructions set up for your LLM models?

You can make their responses fairly dry/brief.

reply

upvote

by embedding-shape7 hours ago|

[-]

I'm mostly using them via my own harnesses, so I have full control of the system prompts and so on. And no matter what I try, Gemini keeps "helpfully" adding code comments every now and then. With every other model, "- Don't add code comments" tends to be enough, but with Gemini I'm not sure how I could stop the comments from eventually appearing.

reply

upvote

by WarmWash6 hours ago|

[-]

I'm pretty sure it writes comments for itself, not for the user. I always let the models comment as much as they want, because I feel it makes the context more robust, especially when cycling contexts often to keep them fresh.

There is a tradeoff though, as comments do consumer context. But I tend to pretty liberally dispense of instances and start with a fresh window.

reply

upvote

by embedding-shape6 hours ago|

[-]

> I'm pretty sure it writes comments for itself, not for the user

Yeah, that sounds worse than "trying to helpful". Read the code instead, why add indirection in that way, just to be able to understand what other models understand without comments?

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by metal_am6 hours ago|

[-]

I'd love to hear some examples!

reply

upvote

by gavinray6 hours ago|

[-]

I use LLM's outside of work primarily for research on academic topics, so mine is:

  Be a proactive research partner: challenge flawed or unproven ideas with evidence; identify inefficiencies and suggest better alternatives with reasoning; question assumptions to deepen inquiry.

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by ai4prezident6 hours ago|

[-]

[dead]

reply

upvote

by zengineer7 hours ago|

[-]

true, whenever I ask Gemini to help me with a prompt for generating an image of XYZ, it generates the image.

reply

upvote

by jasonjmcghee5 hours ago|

[-]

What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.

The most absurd benchmaxxing.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

reply

upvote

by simonw4 hours ago|

[-]

I like how they also did a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.

reply

upvote

by jasonjmcghee4 hours ago|

[-]

Ok Google what are some other examples like a pelican riding a bicycle

reply

upvote

by simultsop3 hours ago|

[-]

reminds me of andor, luthen, positive reinforcing wasting time of emperor

reply

upvote

by threatofrain4 hours ago|

[-]

Animated SVG is huge. People in different professions are worrying to different degrees in terms of being replaced by ML, but this one is huge with regards to digital art.

reply

upvote

by yieldcrv3 hours ago|

[-]

yeah, complex SVG's are so much more bandwidth, computation and energy efficient than raster images - up to a point! but in general use we are not at that point and there's so much more we can do with it

I've been meaning to let coding agents take a stab at using the lottie library https://github.com/airbnb/lottie-web to supercharge the user experience without needing to make it a full time job

reply

upvote

by eurekin5 hours ago|

[-]

Can't wait until they finally get to real world CAD

reply

upvote

by tngranados4 hours ago|

[-]

There's a CAD example in that same thread: https://x.com/JeffDean/status/2024528776856817813

reply

upvote

by tantalor5 hours ago|

[-]

He's svg-mogging

reply

upvote

by gnatolf5 hours ago|

[-]

So let's put things we're interested in in the benchmarks.

I'm not against pelicans!

reply

upvote

by ghurtado4 hours ago|

[-]

I think the reason the pelican example is great is because it's bizarre enough that it's unlikely that to appear in the training as one unified picture.

If we picked something more common, like say, a hot dog with toppings, then the training contamination is much harder to control.

reply

upvote

by troymc1 hours ago|

[-]

I think it's now part of their training though, thanks to Simon constantly testing every new model against it, and sharing his results publicly.

There's a specific term for this in education and applied linguistics: the washback effect.

reply

upvote

by rvnx4 hours ago|

[-]

It's the most common SVG test, it's the equivalent of Will Smith eating spaghettis, so obviously they benchmax toward it

reply

upvote

by casey24 hours ago|

[-]

You don't have to benchmax everything, just the benchmarks in the right social circles

reply

upvote

by UltraSane5 hours ago|

[-]

It if funny to think that Jeff Dean personally worked to optimize the pelican riding a bike benchmark.

reply

upvote

by MrCheeze6 hours ago|

[-]

Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.

reply

upvote

by tedsanders5 hours ago|

[-]

A few thoughts:

- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).

- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.

reply

upvote

by emp173443 hours ago|

[-]

We’re literally at the point where trillions of dollars have been invested in these things and the surrounding harnesses and architecture, and they still can’t do economically useful work on their own. You’re way too bullish here.

reply

upvote

by dbeardsl3 hours ago|

[-]

Neither do cars until very recently. A tool doesn't have to be unsupervised to be useful.

reply

upvote

by simonw6 hours ago|

[-]

My best guess is that the labs put a lot of work into HTML and CSS spatial stuff because web frontend is such an important application of the models, and those improvements leaked through to SVG as well.

reply

upvote

by mitkebes4 hours ago|

[-]

All models have improved, but from my understanding, Gemini is the main one that was specifically trained on photos/video/etc in addition to text. Other models like earlier chatgpt builds would use plugins to handle anything beyond text, such as using a plugin to convert an image into text so that chatgpt could "see" it.

Gemini was multimodal from the start, and is naturally better at doing tasks that involve pictures/videos/3d spatial logic/etc.

The newer chatgpt models are also now multimodal, which has probably helped with their svg art as well, but I think Gemini still has an edge here

reply

upvote

by pknerd5 hours ago|

[-]

> Does anyone understand why LLMs have gotten so good at this?

Added more IF/THEN/ELSE conditions.

reply

upvote

by kridsdale35 hours ago|

[-]

More wires and jumpers on the breadboard.

reply

upvote

by 5 hours ago|

[-]

deleted

reply

upvote

by sam_14217 hours ago|

[-]

Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes

reply

upvote

by cbsks6 hours ago|

[-]

That’s Simon’s goal. “All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.”

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

reply

upvote

by travisgriggs5 hours ago|

[-]

So once that's achieved, I wonder how well it deals with unsuspected variations. E.g.

"Give me an illustration of a bicycle riding by a pelican"

"Give me an illustration of a bicycle riding over a pelican"

"Give me an illustration of a bicycle riding under a flying pelican"

So on and so forth. Or will it start to look like the Studio C sketch about Lobster Bisque: https://youtu.be/A2KCGQhVRTE

reply

upvote

by embedding-shape7 hours ago|

[-]

Soon? I'd be willing to bet it's been included in the training set at least 6 months by now. Not so obvious so it generates always perfect pelicans on bikes, but sufficiently for the "minibench" to be less useful today than in the past.

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by Rudybega3 hours ago|

[-]

If only there were some way to test it, like swapping the two nouns in the sentence. Alas.

reply

upvote

by jsheard7 hours ago|

[-]

Simons been doing this exact test for nearly 18 months now, if vendors want to benchmaxx it then they've had more than enough time to do so already.

reply

upvote

by stri8ted6 hours ago|

[-]

Exactly. As far as I'm concerned, the benchmark is useless. It's way too easy and rewarding to train on it.

reply

upvote

by bonoboTP4 hours ago|

[-]

It's just an in-joke, he doesn't intend it as a serious benchmark anymore. I think it's funny.

reply

upvote

by Legend24406 hours ago|

[-]

Y'all are way too skeptical, no matter what cool thing AI does you'll make up an excuse for how they must somehow be cheating.

reply

upvote

by toraway5 hours ago|

[-]

Jeff Dean literally featured it in a tweet announcing the model. Personally it feels absurd to believe they've put absolutely no thought into optimizing this type of SVG output given the disproportionate amount of attention devoted to a specific test for 1 yr+.

I wouldn't really even call it "cheating" since it has improved models' ability to generate artistic SVG imagery more broadly but the days of this being an effective way to evaluate a model's "interdisciplinary" visual reasoning abilities have long since passed, IMO.

It's become yet another example in the ever growing list of benchmaxxed targets whose original purpose was defeated by teaching to the test.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

reply

upvote

by arcatech5 hours ago|

[-]

Or maybe you’re too trusting of companies who have already proven to not be trustworthy?

reply

upvote

by pixl976 hours ago|

[-]

I mean if you want to make your own benchmark, simply don't make it public and don't do it often. If your salamander on skis or whatever gets better with time it likely has nothing to do with being benchmaxxed.

reply

upvote

by ks20484 hours ago|

[-]

Forget the paperclip maximizer - AGI will turn the whole world into pelicans on bikes.

reply

upvote

by SoKamil6 hours ago|

[-]

It seems they trained the model to output good svg’s.

In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.

[1] https://blog.google/innovation-and-ai/models-and-research/ge...

reply

upvote

by Arcuru7 hours ago|

[-]

Did you stop using the more detailed prompt? I think you described it here: https://simonwillison.net/2025/Nov/18/gemini-3/

reply

upvote

by simonw6 hours ago|

[-]

It seems to be having capacity problems right now but I'll run that as soon as I can get it to work.

reply

upvote

by simonw4 hours ago|

[-]

Pretty solid: https://gist.github.com/simonw/f5c893203621a7631ff178d9093a8...

reply

upvote

by culi3 hours ago|

[-]

Cost per task has increased 4.2x but their ARC-AGI-2 score went from 33.6% to 77.1%

Cost per task is still significantly lower than Opus. Even Opus 4.5

https://arcprize.org/leaderboard

reply

upvote

by WarmWash7 hours ago|

[-]

Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.

https://www.svgviewer.dev/s/dEdbH8Sw

reply

upvote

by InitialLastName6 hours ago|

[-]

I don't know what of this is the prompt and what was the output, but that's a pretty bad schematic (for both aesthetic and circuit-design reasons).

reply

upvote

by WarmWash6 hours ago|

[-]

The prompts were doing the design, reference voltage, hysteresis, output stage, all the maths and then the SVG is from asking the model to take all that and the current BOM to make an SVG schematic of it. In the past models would just output totally incoherent messes of lines and shapes.

I did a larger circuit too that this is part of, but it's not really for sharing online.

reply

upvote

by svnt6 hours ago|

[-]

Yes but you concede it is a schematic.

reply

upvote

by tadfisher5 hours ago|

[-]

How far we have come!

reply

upvote

by 0_____06 hours ago|

[-]

that's pretty amazing for an LLM but as an EE, if my intern did this i would sigh inwardly and pull up some existing schematics for some brief guidance on symbol layout.

reply

upvote

by brikym2 hours ago|

[-]

Another great benchmark would be to convert a raster image of a logo into SVG. I've yet to find a good tool for this that produces accurate smooth lines.

reply

upvote

by AmazingTurtle6 hours ago|

[-]

At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?

reply

upvote

by ascorbic4 hours ago|

[-]

Loads of examples here https://x.com/jeffdean/status/2024525132266688757

reply

upvote

by tromp6 hours ago|

[-]

Or, even more challenging, an okapi on a recumbent ?!

reply

upvote

by steve_adams_867 hours ago|

[-]

Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub

But seriously, I can't believe LLMs are able to one-shot a pelican on a bicycle this well. I wouldn't have guessed this was going to emerge as a capability from LLMs 6 years ago. I see why it does now, but... It still amazes me that they're so good at some things.

reply

upvote

by emp173446 hours ago|

[-]

Is this capability “emergent”, or do AI firms specifically target SVG generation in order to improve it? How would we be able to tell?

reply

upvote

by steve_adams_865 hours ago|

[-]

I asked myself the same thing as I typed that comment, and I'm not sure what the answer is. I don't think models are specifically trained on this (though of course they're trained on how to generate SVGs in general), but I'm prepared to be wrong.

I have a feeling the most 'emergent' aspect was that LLMs have generally been able to produce coherent SVG for quite a while, likely without specific training at first. Since then I suspect there has been more tailored training because improvements have been so dramatic. Of course it makes sense that text-based images using very distinct structure and properties could be manipulated reasonably well by a text-based language model, but it's still fascinating to me just how well it can work.

Perhaps what's most incredible about it is how versatile human language is, even when it lacks so many dimensions as bits on a machine. Yet it's still cool that we can resurrect those bits at rest and transmogrify them back into coherent projections of photons from a screen.

I don't think LLMs are AGI or about to completely flip the world upside down or whatever, but it seems undeniably magical when you break it down.

reply

upvote

by simonw6 hours ago|

[-]

Google specifically boast about their SVG performance in the announcement post: https://blog.google/innovation-and-ai/models-and-research/ge...

You can try any combination of animal on vehicle to confirm that they likely didn't target pelicans directly though.

reply

upvote

by 0_____06 hours ago|

[-]

next time you host a party, have people try to draw a bicycle on your whiteboard (you have a whiteboard in your house right? you should, anyway...)

human adults are generally quite bad at drawing them, unless they spend a lot of time actually thinking about bicycles as objects

reply

upvote

by 5423542342356 hours ago|

[-]

They are, and it is very funny.

https://www.behance.net/gallery/35437979/Velocipedia

reply

upvote

by iammattmurphy5 hours ago|

[-]

Fantastic post, thanks for that.

reply

upvote

by emp173446 hours ago|

[-]

What’s your point? Yes, humans fail sometimes, as do AI models. Are you trying to imply that, in light of this, AI is now as capable as human beings? If so, that conclusion doesn’t follow logically.

reply

upvote

by 0_____06 hours ago|

[-]

it's not a loaded point, i just think it's funny that humans typically cannot one-shot this. and it will make your friends laugh

reply

upvote

by HPsquared6 hours ago|

[-]

And the left leg is straight while the right leg is bent.

EDIT: And the chain should pass behind the seat stay.

reply

upvote

by bredren7 hours ago|

[-]

What is that, a snack in the basket?

reply

upvote

by sigmar7 hours ago|

[-]

"integrating a bicycle basket, complete with a fish for the pelican... also ensuring the basket is on top of the bike, and that the fish is correctly positioned with its head up... basket is orange, with a fish inside for fun."

how thoughtful of the ai to include a snack. truly a "thanks for all the fish"

reply

upvote

by defen6 hours ago|

[-]

A pelican already has an integrated snack-holder, though. It wouldn't need to put it in the basket.

reply

upvote

by SauntSolaire4 hours ago|

[-]

That one's full too

reply

upvote

by troymc1 hours ago|

[-]

The number of snacks in the basket is a random variable with a Poisson distribution.

reply

upvote

by WarmWash7 hours ago|

[-]

A fish for the road

reply

upvote

by tarr116 hours ago|

[-]

What do you think this particular prompt is evaluating for?

The more popular these particular evals are, the more likely the model will be trained for them.

reply

upvote

by Gander57395 hours ago|

[-]

Sea https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

reply

upvote

by TZubiri4 hours ago|

[-]

You think they are able to see their output and iterate on it? Or is it pure token generation?

reply

upvote

by infthi7 hours ago|

[-]

Wonder when will we get something other than a side view

reply

upvote

by mikepurvis6 hours ago|

[-]

That would be a especially challenging for vector output. I tried just now on ChatGPT 5.2 to jump straight to an image, with this prompt:

"make me a cartoon image of a pelican riding a bicycle, but make it from a front 3/4 view, that is riding toward the viewer."

The result was basically a head-on view, but I expect if you then put that back in and said, "take this image and vectorize it as an SVG" you'd have a much better time than trying to one-shot the SVG directly from a description.

... but of course, if that's so, then what's preventing the model from being smart enough to identify this workflow and follow it on its own to get the task completed?

reply

upvote

by calny7 hours ago|

[-]

Great pelican but what’s up with that fish in the basket?

reply

upvote

by coldtea7 hours ago|

[-]

It's a pelican. What do you expect a pelican to have in his bike's basket?

It's a pretty funny and coherent touch!

reply

upvote

by embedding-shape6 hours ago|

[-]

> What do you expect a pelican to have in his bike's basket?

Probably stuff it cannot fit in the gullet, or don't want there (think trash). I wouldn't expect a pelican to stash fish there, that's for sure.

reply

upvote

by kridsdale35 hours ago|

[-]

You never travel with a snack fish for later on? He's going to be burning calories.

reply

upvote

by nicr_225 hours ago|

[-]

Yeah, why only _one_ fish?

It's obvious that pelican is riding long distance, no way a single fish is sufficiently energy dense for more than a few miles.

Can't the model do basic math???

reply

upvote

by gavinray7 hours ago|

[-]

Where else are cycling Pelican's meant to keep their fish?

reply

upvote

by calny6 hours ago|

[-]

I get it, I just meant the fish is poorly done, when I’d have guessed it would be relatively simple part. Maybe the black dot eye is misplaced idk.

reply

upvote

by mohsen17 hours ago|

[-]

is there something in your prompt about hats? why the pelican always wearing a hat recently?!

reply

upvote

by bigfishrunning7 hours ago|

[-]

At this point, i think maybe they're training on all of the previous pelicans, and one of them decided to put a hat on it?

Disclaimer: This is an unsubstantiated claim that i made up

reply

upvote

by xnx7 hours ago|

[-]

Not even animated? This is 2026.

reply

upvote

by readitalready7 hours ago|

[-]

Jeff Dean just posted an animated version: https://x.com/JeffDean/status/2024525132266688757

reply

upvote

by benbreen6 hours ago|

[-]

One underrated thing about the recent frontier models, IMO, is that they are obviating the need for image gen as a standalone thing. Opus 4.6 (and apparently 3.1 Pro as well) doesn't have the ability to generate images but it is so good at making SVG that it basically doesn't matter at this point. And the benefit of SVG is that it can be animated and interactive.

I find this fascinating because it literally just happened in the past few months. Up until ~summer of 2025, the SVG these models made was consistently buggy and crude. By December of 2026, I was able to get results like this from Opus 4.5 (Henry James: the RPG, made almost entirely with SVG): https://the-ambassadors.vercel.app

And now it looks like Gemini 3.1 Pro has vaulted past it.

reply

upvote

by embedding-shape6 hours ago|

[-]

> doesn't have the ability to generate images but it is so good at making SVG that it basically doesn't matter at this point

Yeah, since the invention of vector images, suddenly no one cares about raster images anymore.

Obviously not true, but that's how your comment reads right now. "Image" is very different from "Image", and one doesn't automagically replace the other.

reply

upvote

by buu7006 hours ago|

[-]

This reminds me of the time I printed a poster with a blown up version of some image for a high school history project. A classmate asked how I did it, so I started going on about how I used software to vectorize the image. Turned out he didn't care about any of that and just wanted the name of the print shop.

reply

upvote

by Der_Einzige5 hours ago|

[-]

You have no idea how badly I want to be teleported to the alternative world where VECTOR COMPUTING was the dominant form of computers.

We had high framerate (yes it was variable), bright, beautiful displays in the 1980s with the vectrex.

reply

upvote

by cachius6 hours ago|

[-]

2025 that is

reply

upvote

by bigfishrunning7 hours ago|

[-]

That Ostrich Tho

reply

upvote

by cachius6 hours ago|

[-]

That Tires Tho

reply

upvote

by DonHopkins5 hours ago|

[-]

How about STL files for 3d printing pelicans!

reply

upvote

by baq5 hours ago|

[-]

Harder: the bike must work

Hardest: the pelican must work

reply

upvote

by benatkin6 hours ago|

[-]

I used the AI studio link and tried running it with the temperature set to 1.75: https://jsbin.com/locodaqovu/edit?html,output

reply

upvote

by saberience7 hours ago|

[-]

I hope we keep beating this dead horse some more, I'm still not tired of it.

reply