> there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.
There are estimated to be 100 or so prepositions in English. That gets you to 4 trillion combinations.
It's called "The science of cycology: Failures to understand how everyday objects work" by Rebecca Lawson.
https://link.springer.com/content/pdf/10.3758/bf03195929.pdf
A fundamental part of the job is being able to break down problems from large to small, reason about them, and talk about how you do it, usually with minimal context or without deep knowledge in all aspects of what we do. We're abstraction artists.
That question wouldn't be fundamentally different than any other architecture question. Start by drawing big, hone in on smaller parts, think about edge cases, use existing knowledge. Like bread and butter stuff.
I much more question your reaction to the joke than using it as a hypothetical interview question. I actually think it's good. And if it filters out people that have that kind of reaction then it's excellent. No one wants to work with the incurious.
> Without a clear indicator of the author's intent, any parodic or sarcastic expression of extreme views can be mistaken by some readers for a sincere expression of those views.
https://www.freepik.com/free-vector/cyclist_23714264.htm
https://www.freepik.com/premium-vector/bicycle-icon-black-li...
Or missing/broken pedals:
https://www.freepik.com/premium-vector/bicycle-silhouette-ic...
https://www.freepik.com/premium-vector/bicycle-silhouette-ve...
http://freepik.com/premium-vector/bicycle-silhouette-vector-...
Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?
Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)
I think when I first tried this I iterated a few times to get to something that reliably output SVG, but honestly I didn't keep the notes I should ahve.
They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.
What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.
As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.
I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.
$200 * 1,000 = $200k/month.
I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.
I don't think it quite captures their majesty: https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D
Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?