undefined

> there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.

https://news.ycombinator.com/item?id=45455786

by eli4 hours ago|

parent|

[-]

How would you generate a picture of Noun + Noun in the first place in order to train the LLM with what it would look like? What's happening during that 1 estimated second?

by metalliqaz2 hours ago|

parent|

[-]

its pelicans all the way down

by Terretta4 hours ago|

parent|

prev|

[-]

This is why everyone trains their LLM on another LLM. It's all about the pelicans.

by AnimalMuppet3 hours ago|

parent|

prev|

[-]

But you need to also include the number of prepositions. "A pelican on a bicycle" is not at all the same as "a pelican inside a bicycle".

There are estimated to be 100 or so prepositions in English. That gets you to 4 trillion combinations.

by gcanyon6 hours ago|

prev|

[-]

One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.

by arionmiles4 hours ago|

parent|

[-]

There's a research paper from the University of Liverpool, published in 2006 where researchers asked people to draw bicycles from memory and how people overestimate their understanding of basic things. It was a very fun and short read.

It's called "The science of cycology: Failures to understand how everyday objects work" by Rebecca Lawson.

https://link.springer.com/content/pdf/10.3758/bf03195929.pdf

by devilcius2 hours ago|

parent|

[-]

There’s also a great art/design project about exactly this. Gianluca Gimini asked hundreds of people to draw a bicycle from memory, and most of them got the frame, proportions, or mechanics wrong. https://www.gianlucagimini.it/portfolio-item/velocipedia/

by rcxdude4 hours ago|

parent|

prev|

[-]

A place I worked at used it as part of an interview question (it wasn't some pass/fail thing to get it 100% correct, and was partly a jumping off point to a different question). This was in a city where nearly everyone uses bicycles as everyday transportation. It was surprising how many supposedly mechanical-focused people who rode a bike everyday, even rode a bike to the interview, would draw a bike that would not work.

by gcanyon2 hours ago|

parent|

[-]

I wish I had interviewed there. When I first read that people have a hard time with this I immediately sat down without looking at a reference and drew a bicycle. I could ace your interview.

by throwuxiytayq3 hours ago|

parent|

prev|

[-]

This is why at my company in interviews we ask people to draw a CPU diagram. You'd be surprised how many supposedly-senior computer programmers would draw a processor that would not work.

by niobe3 hours ago|

parent|

[-]

If I was asked that question in an interview to be a programmer I'd walk out. How many abstraction layers either side of your knowledge domain do you need to be an expert in? Further, being a good technologist of any kind is not about having arcane details at the tip of your frontal lobe, and a company worth working for would know that.

by duped22 minutes ago|

parent|

[-]

I mean gp is clearly a joke but

A fundamental part of the job is being able to break down problems from large to small, reason about them, and talk about how you do it, usually with minimal context or without deep knowledge in all aspects of what we do. We're abstraction artists.

That question wouldn't be fundamentally different than any other architecture question. Start by drawing big, hone in on smaller parts, think about edge cases, use existing knowledge. Like bread and butter stuff.

I much more question your reaction to the joke than using it as a hypothetical interview question. I actually think it's good. And if it filters out people that have that kind of reaction then it's excellent. No one wants to work with the incurious.

by selcuka35 minutes ago|

parent|

prev|

[-]

Poe's Law [1]:

> Without a clear indicator of the author's intent, any parodic or sarcastic expression of extreme views can be mistaken by some readers for a sincere expression of those views.

[1] https://en.wikipedia.org/wiki/Poe%27s_law

by gedy3 hours ago|

parent|

prev|

[-]

That's reasonable in many cases, but I've had situations like this for senior UI and frontend positions, and they: don't ask UI or frontend questions. And ask their pet low level questions. Some even snort that it's softball to ask UI questions or "they use whatever". It's like, yeah no wonder your UI is shit and now you are hiring to clean it up.

by rsc3 hours ago|

parent|

prev|

[-]

Raises hand.

by gnatolf5 hours ago|

parent|

prev|

[-]

Absolutely. A technically correct bike is very hard to draw in SVG without going overboard in details

by falloutx5 hours ago|

parent|

[-]

Its not. There are thousands of examples on the internet but good SVG sites do have monetary blocks.

https://www.freepik.com/free-photos-vectors/bicycle-svg

by jefftk3 hours ago|

parent|

[-]

Several of those have incorrect frames:

https://www.freepik.com/free-vector/cyclist_23714264.htm

https://www.freepik.com/premium-vector/bicycle-icon-black-li...

Or missing/broken pedals:

https://www.freepik.com/premium-vector/bicycle-silhouette-ic...

https://www.freepik.com/premium-vector/bicycle-silhouette-ve...

http://freepik.com/premium-vector/bicycle-silhouette-vector-...

by gnatolf3 hours ago|

parent|

prev|

[-]

From smaller to larger nitpick, there's basically something wrong with all of the first 15 or so of these drawings. Thanks for agreeing :)

by RussianCow4 hours ago|

parent|

prev|

[-]

I'm not positive I could draw a technically correct bike with pen and paper (without a reference), let alone with SVG!

by nateglims4 hours ago|

parent|

prev|

[-]

I just had an idea for an RLVR startup.

by cyanydeez5 hours ago|

parent|

prev|

[-]

Yes, but obviously AGI will solve this by, _checks notes_ more TerraWatts!

by hackernudes5 hours ago|

parent|

[-]

The word is terawatts unless you mean earth-based watts. OK then, it's confirmed, data centers in space!

by seanhunter5 hours ago|

parent|

prev|

[-]

…in space!

by franze4 hours ago|

prev|

[-]

here the animated version https://claude.ai/public/artifacts/3db12520-eaea-4769-82be-7...

by gryfft4 hours ago|

parent|

[-]

That's hilarious. It's so close!

by einrealist6 hours ago|

prev|

[-]

They trained for it. That's the +0.1!

by etwigg2 hours ago|

prev|

[-]

If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!

by zahlman3 hours ago|

prev|

[-]

Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?

Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?

Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)

by simonw1 hours ago|

parent|

[-]

I've stuck with "Generate an SVG of a pelican riding a bicycle" because it's the same prompt I've been using for over a year now and I want results that are sort-of comparable to each other.

I think when I first tried this I iterated a few times to get to something that reliably output SVG, but honestly I didn't keep the notes I should ahve.

by athrowaway3z6 hours ago|

prev|

[-]

This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.

As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.

by hoeoek6 hours ago|

prev|

[-]

This really is my favorite benchmark

by eaf7e2816 hours ago|

prev|

[-]

There's no way they actually work on training this.

by margalabargala6 hours ago|

parent|

[-]

I suspect they're training on this.

I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.

https://i.imgur.com/UvlEBs8.png

by WarmWash5 hours ago|

parent|

[-]

It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.

by ryandrake5 hours ago|

parent|

[-]

Having briefly worked in the 3D Graphics industry, I don't even remotely trust benchmarks anymore. The minute someone's benchmark performance becomes a part of the public's purchasing decision, companies will pull out every trick in the book--clean or dirty--to benchmaxx their product. Sometimes at the expense of actual real-world performance.

by seanhunter5 hours ago|

parent|

prev|

[-]

Pelicans don’t ride bikes. You can’t have scruples about whether or not the image of a pelican riding a bike has arms.

by jevinskie5 hours ago|

parent|

[-]

Wouldn’t any decent bike-riding pelican have a bike tailored to pelicans and their wings?

by actsasbuffoon3 hours ago|

parent|

[-]

Sure, that’s one solution. You could also Isle of Dr Moreau your way to a pelican that can use a regular bike. The sky is the limit when you have no scruples.

by cinntaile4 hours ago|

parent|

prev|

[-]

Now that would be a smart chat agent.

by mrandish6 hours ago|

parent|

prev|

[-]

Interesting that it seems better. Maybe something about adding a highly specific yet unusual qualifier focusing attention?

by riffraff4 hours ago|

parent|

prev|

[-]

perhaps try a penny farthing?

by KeplerBoy6 hours ago|

parent|

prev|

[-]

There is no way they are not training on this.

by 6 hours ago|

parent|

[-]

deleted

by collinmanderson6 hours ago|

parent|

prev|

[-]

I suspect they have generic SVG drawing that they focus on.

by fragmede5 hours ago|

parent|

prev|

[-]

The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?

$200 * 1,000 = $200k/month.

I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.

by beemboy3 hours ago|

prev|

[-]

Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?

by bityard5 hours ago|

prev|

[-]

Well, the clouds are upside-down, so I don't think I can give it a pass.

by nine_k5 hours ago|

prev|

[-]

I suppose the pelican must be now specifically trained for, since it's a well-known benchmark.

by 7777777phil6 hours ago|

prev|

[-]

best pelican so far would you say? Or where does it rank in the pelican benchmark?

by mrandish6 hours ago|

parent|

[-]

In other words, is it a pelican or a pelican't?

by canadiantim3 hours ago|

parent|

[-]

You’ve been sitting on that pun just waiting for it to take flight

by nubg6 hours ago|

prev|

[-]

What about the Pelo2 benchmark? (the gray bird that is not gray)

by copilot_king_26 hours ago|

prev|

[-]

I'm firing all of my developers this afternoon.

by RGamma5 hours ago|

parent|

[-]

Opus 6 will fire you instead for being too slow with the ideas.

by insane_dreamer3 hours ago|

parent|

prev|

[-]

Too late. You’ve already been fired by a moltbot agent from your PHB.

by 6thbit4 hours ago|

prev|

[-]

do you have a gif? i need an evolving pelican gif

by risyachka4 hours ago|

prev|

[-]

Pretty sure at this point they train it on pelicans

by ares6236 hours ago|

prev|

[-]

Can it draw a different bird on a bike?

by simonw6 hours ago|

parent|

[-]

Here's a kākāpō riding a bicycle instead: https://gist.github.com/simonw/19574e1c6c61fc2456ee413a24528...

I don't think it quite captures their majesty: https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D

by zahlman3 hours ago|

parent|

[-]

Now that I've looked it all up, I feel like that's much more accurate to a real kākāpō than the pelican is to a real pelican. It's almost as if it thinks a pelican is just a white flamingo with a different beak.

by DetroitThrow6 hours ago|

prev|

[-]

The ears on top are a cute touch

by iujasdkjfasf3 hours ago|

prev|

[-]

[dead]

by behnamoh6 hours ago|

prev|

[-]

[flagged]

by smokel5 hours ago|

parent|

[-]

I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.

by blibble3 hours ago|

parent|

[-]

it ceases to be a useful benchmark of general ability when you post it publicly for them to train against

by quinnjh5 hours ago|

parent|

prev|

[-]

the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.

Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?

by techpression3 hours ago|

parent|

[-]

A benchmark only tests what the benchmark is doing, the goal is to make that task correlate with actually valuable things. Graphic benchmarks is a good example, extremely hard to know what you will get in a game by looking at 3D Mark scores, it varies by a lot. Making a SVG of a single thing doesn’t help much unless that applies to all SVG tasks.

by fullstackchris3 hours ago|

prev|

[-]

[flagged]

by dang2 hours ago|

parent|

[-]

Personal attacks are not allowed on HN. No more of this, please.