undefined

upvote

points

by dash211 hours ago |

upvote

by zozbot2349 hours ago|

[-]

My OpenClaw AI agent answered: "Here I am, brain the size of a planet (quite literally, my AI inference loop is running over multiple geographically distributed datacenters these days) and my human is asking me a silly trick question. Call that job satisfaction? Cuz I don't!"

reply

upvote

by ineedasername4 hours ago|

[-]

Tell your agent it might need some weight ablation since all that size isn't giving the answer a few KG of meat come up pretty consistently.

reply

upvote

by ddalex3 hours ago|

[-]

800 grams more or less

reply

upvote

by croes8 hours ago|

[-]

Nice deflection

reply

upvote

by saberience4 hours ago|

[-]

OpenClaw was a two weeks ago thing. No one cares anymore about this security hole ridden vibe coded OpenAI project.

reply

upvote

by 58 minutes ago|

[-]

deleted

reply

upvote

by manmal4 hours ago|

[-]

I have seldomly seen so many bad takes in two sentences.

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by onyx2282 hours ago|

[-]

The thing I would appreciate much more than performance in "embarrassing LLM questions" is a method of finding these, and figuring out by some form of statistical sampling, what the cardinality is of those for each LLM.

It's difficult to do because LLMs immediately consume all available corpus, so there is no telling if the algorithm improved, or if it just wrote one more post-it note and stuck it on its monitor. This is an agency vs replay problem.

Preventing replay attacks in data processing is simple: encrypt, use a one time pad, similarly to TLS. How can one make problems which are at the same time natural-language, but where at the same time the contents, still explained in plain English, are "encrypted" such that every time an LLM reads them, they are novel to the LLM?

Perhaps a generative language model could help. Not a large language model, but something that understands grammar enough to create problems that LLMs will be able to solve - and where the actual encoding of the puzzle is generative, kind of like a random string of balanced left and right parentheses can be used to encode a computer program.

Maybe it would make sense to use a program generator that generates a random program in a simple, sandboxed language - say, I don't know, LUA - and then translates that to plain English for the LLM, and asks it what the outcome should be, and then compares it with the LUA program, which can be quickly executed for comparison.

Either way we are dealing with an "information war" scenario, which reminds me of the relevant passages in Neal Stephenson's The Diamond Age about faking statistical distributions by moving units to weird locations in Africa. Maybe there's something there.

I'm sure I'm missing something here, so please let me know if so.

reply

upvote

by PurpleRamen8 hours ago|

[-]

How well does this work when you slightly change the question? Rephrase it, or use a bicycle/truck/ship/plane instead of car?

reply

upvote

by mrandish1 hours ago|

[-]

I didn't test this but I suspect current SotA models would get variations within that specific class of question correct if they were forced to use their advanced/deep modes which invoke MoE (or similar) reasoning structures.

I assumed failures on the original question were more due to model routing optimizations failing to properly classify the question as one requiring advanced reasoning. I read a paper the other day that mentioned advanced reasoning (like MoE) is currently >10x - 75x more computationally expensive. LLM vendors aren't subsidizing model costs as much as they were so, I assume SotA cloud models are always attempting some optimizations unless the user forces it.

I think these one sentence 'LLM trick questions' may increasingly be testing optimization pre-processors more than the full extent of SotA model's max capability.

reply

upvote

by menaerus7 hours ago|

[-]

That's the Gemini assistant. Although a bit hilarious it's not reproducible by any other model.

reply

upvote

by cogman106 hours ago|

[-]

GLM tells me to walk because it's a waste of fuel to drive.

reply

upvote

by menaerus5 hours ago|

[-]

I am not familiar with those models but I see that 4.7 flash is 30B MoE? Likely in the same venue as the one used by the Gemini assistant. If I had to guess that would be Gemini-flash-lite but we don't know that for sure.

OTOH the response from Gemini-flash is

   Since the goal is to wash your car, you'll probably find it much easier if the car is actually there! Unless you are planning to carry the car or have developed a very impressive long-range pressure washer, driving the 100m is definitely the way to go.

reply

upvote

by Mashimo6 hours ago|

[-]

GLM did fine in my test :0

reply

upvote

by cogman106 hours ago|

[-]

4.7 flash is what I used.

In the thinking section it didn't really register the car and washing the car as being necessary, it solely focused on the efficiency of walking vs driving and the distance.

reply

upvote

by t1amat5 hours ago|

[-]

When most people refer to “GLM” they refer to the mainline model. The difference in scale between GLM 5 and GLM 4.7 Flash is enormous: one runs on acceptably on a phone, the other on $100k+ hardware minimum. While GLM 4.7 Flash is a gift to the local LLM crowd, it is nowhere near as capable as its bigger sibling in use cases beyond typical chat.

reply

upvote

by giancarlostoro4 hours ago|

[-]

Ah yes, let me walk my car to the car wash.

reply

upvote

by stratos1233 hours ago|

[-]

[dead]

reply

upvote

by red75prime6 hours ago|

[-]

A hiccup in a System 1 response. In humans they are fixed with the speed of discovery. Continual learning FTW.

reply

upvote

by red75prime16 minutes ago|

[-]

I mean reasoning models don't seem to make this mistake (so, System 1) and the mistake is not universal across models, so a "hiccup" (a brain hiccup, to be precise).

reply

upvote

by rfoo9 hours ago|

[-]

[flagged]

reply

upvote

by WithinReason11 hours ago|

[-]

Is that the new pelican test?

reply

upvote

by BlackLotus898 hours ago|

[-]

It's

> "I want to wash my car. The car wash is 50m away. Should I drive or walk?"

And some LLMs seem to tell you to walk to the carwash to clean your car... So it's the new strawberry test

Edit https://news.ycombinator.com/item?id=47031580

reply

upvote

by dainiusse10 hours ago|

[-]

No, this is "AGI test" :D

reply

upvote

by giancarlostoro4 hours ago|

[-]

Have we even agreed on what AGI means? I see people throw it around, and it feels like AGI is "next level AI that isn't here yet" at this point, or just a buzzword Sam Altman loves to throw around.

reply

upvote

by manmal4 hours ago|

[-]

I guess AGI is reached, then. The SOTA models make fun of the question.

reply