undefined

points

[-]

If that prompt can be easily trained against, it probably doesn't exploit a generic bias. These are not that interesting, and there's no point in hiding them.

by fwip179 days ago|

parent|

[-]

Sure there is. If you want to know if students understand the material, you don't hand out the answers to the test ahead of time.

Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.

by daedrdev179 days ago|

parent|

prev|

[-]

generic biases can also be fixed

by orbital-decay179 days ago|

parent|

[-]

*Some generic biases. Some others like recency bias, serial-position effect, "pink elephant" effect, negation accuracy seem to be pretty fundamental and are unlikely to be fixed without architectural changes, or at all. Things exploiting in-context learning and native context formatting are also hard to suppress during the training without making the model worse.

by pc86179 days ago|

prev|

[-]

May I ask outside of normal curiosity, what good is a prompt that breaks a model? And what is trying to keep it "secret"?

by tveita179 days ago|

parent|

[-]

You want to know if a new model is actually better, which you won't know if they just added the specific example to the training set. It's like handing a dev on your team some failing test cases, and they keep just adding special cases to make the tests pass.

How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?

I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.

by gwern179 days ago|

parent|

[-]

> How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?

Well, that's easy: zero.

Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.

So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...

by SweetSoftPillow179 days ago|

parent|

[-]

No, just check their models Knowledge cutoff dates

by gwern178 days ago|

parent|

[-]

Nope! The knowledge cutoff does not show lack of leakage. Even if you get a non-confabulated cutoff which was before anyone ever asked the strawberry question or any question like it (tokenization 'gotchas' go back to at least davinci in June 2020), there is still leakage from the RLHF and tuning process which collectively constitute post-training, and which would teach the LLMs how to solve the strawberry problem. People are pretty sure about this: the LLMs are way too good at guessing things like who won Oscars or Presidential elections. This leakage is strongest for the most popular questions... which of course the strawberry question would be, as it keeps going viral and has become the deboooonkers' favorite LLM gotcha.

(This is, by the way, why you can't believe any LLM paper about 'forecasting' where they are just doing backtesting, and didn't actually hold out future events. Because there are way too many forms of leakage at this point. This logic may have worked for davinci-001 and davinci-002, or a model whose checkpoints you downloaded yourself, but not for any of the big APIs like GPT or Claude or Gemini...)

by fennecbutt178 days ago|

parent|

prev|

[-]

I always point out how the strawberry thing is a semi pointless exercise anyway.

Because it gets tokenised, of course a model could never count the rs.

But I suppose if we want these models to be capable of anything then these things need to be accounted for.

by maybeOneDay179 days ago|

parent|

prev|

[-]

Being able to test future models without fear that your prompt has just been trained on an answer on HN, I assume.

by asciimov179 days ago|

parent|

prev|

[-]

To gauge how well the models "think" and what amount of slop they generate.

Keeping it secret because I don't want my answers trained into a model.

Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.