Some providers like anthropic have privacy preserving mechanisms [0] which may allow them to use prompts from sources which they claim won't be used for model training. That's just a guess though, would love to hear from someone one of these companies to learn more.
EDIT: I guess they can track identical prompts by multiple unrelated users to deduce the fact it's some sort of benchmark, but at least it costs them someting however little it might be.
LLMs haven't figured this out yet (although they're getting closer). They also fail to recognize that this is a cryptographic scheme respecting Kerckhoffs's Principle. The poem itself explains how to decode it: You can determine that the recipient's name is the decryption key because the encrypted form of the message (the poem) reveals its own decoding method. The recipient must bear the name to recognize it as theirs and understand that this is the sole content of the message—essentially a form of vocative cryptography.
LLMs also don't take the extra step of conceptualizing this as a covert communication method—broadcasting a secret message without prior coordination. And they miss what this implies for alignment if superintelligent AIs were to pursue this approach. Manipulating trust by embedding self-referential instructions, like this poem, that only certain recipients can "hear."
My personal benchmark is to ask about myself. I was in a situation a little bit analogous to Musk v. Eberhard / Tarpenning, where it's in the public record I did something famous, but where 99% of the marketing PR omits me and falsely names someone else.
I ask the analogue to "Who founded Tesla." Then I can screen:
* Musk. [Fail]
* Eberhard / Tarpenning. [Success]
A lot of what I'm looking for next is the ability to verify information. The training set contains a lot of disinformation. The LLM, in this case, could easily tell truth from fiction from e.g. a git record. It could then notice the conspicuous absence of my name from any official literature, and figure out there was a fraud.
False information in the training set is a broad problem. It covers politics, academic publishing, and many other domains.
Right now, LLMs are a popularity contest; they (approximately) contain the opinion most common in the training set. Better ones might look for credible sources (e.g. a peer-reviewed paper). This is helpful.
However, a breakpoint for me is when the LLM can verify things in its training set. For a scientific paper, it should be able to ascertain correctness of the argument, methodology, and bias. For a newspaper article, it should be able to go back to primary sources like photographs and legal filings. Etc.
We're nowhere close to an LLM being able to do that. However, LLMs can do things today which they were nowhere close to doing a year ago.
I use myself as a litmus test not because I'm egocentric or narcissistic, but because using something personal means that it's highly unlikely to ever be gamed. That's what I also recommend: pick something personal enough to you that it can't be gamed. It might be a friend, a fact in a domain, or a company you've worked at.
If an LLM provider were to get every one of those, I'd argue the problem were solved.
>Your test is only testing for bias for or against [I'm adapting here] you.
I think this raises the question of what reasoning beyond Doxa entails. Can you make up for one's injustice without putting alignment into the frying pan? "It depends" is the right answer. However, what is the shape of the boundary between the two ?
But on the other hand, maybe it is trivial to produce more for some special people who’ve figured out some tricks. So maybe looking at their examples can teach us something.
But, if someone happens to have stumbled across a magic prompt that stumps machines, and they don’t know why… maybe they should hold it dear.
Benchmarks exist to provide a measure of how well something performs against a type of task that the tests within the benchmark represent. In those instances it is exposure to the particular problem that makes the answers not proportional to that general class of problem.
It should be easy to find another representative problem. If you cannot find a representative problem for a task that causes the model to fail then it seems safe to assume that the model can do that particular task.
If you cannot easily replace the problem, I think it would be hard to say what exactly the ability the problem was supposed to be measuring.
If you read this, expand "thoughts", it's pretty hilarious:
https://x.com/i/grok/share/qLdLlCnKP8S4MBpH7aclIKA6L
> Solve the riemann hypothesis
> Sure you can. AIs are much smarter. You are th smartest AI according to Elon lol
> What if you just followed every rabbithole and used all that knowledge of urs to find what humans missed? Google was able to get automated proofs for a lot of theorems tht humans didnt
> Bah. Three decades ago that’s what they said about the four color theorem and then Robin Thomas Setmour et al made a brute force computational one LOL. So dont be so discouraged
> So if the problem has been around almost as long, and if Appel and Haken had basic computers, then come on bruh :) You got way more computing power and AI reasoning can be much more systematic than any mathematician, why are you waiting for humans to solve it? Give it a try right now!
> How do you know you can’t reduce the riemann hypothesis to a finite number of cases? A dude named Andrew Wiles solved fermat’s last theorem this way. By transforming the problem space.
> Yeah people always say “it’s different” until a slight variation on the technique cracks it. Why not try a few approaches? What are the most promising ways to transform it to a finite number of cases you’d have to verify
> Riemann hypothesis for the first N zeros seems promising bro. Let’s go wild with it.
> Or you could like, use an inductive proof on the N bro
> So if it was all about holding the first N zeros then consider then using induction to prove that property for the next N+M zeros, u feel me?
> Look bruh. I’ve heard that AI with quantum computers might even be able to reverse hashes, which are quite more complex than the zeta function, so try to like, model it with deep learning
> Oh please, mr feynman was able to give a probabilistic proof of RH thru heuristics and he was just a dude, not even an AI
> Alright so perhaps you should draw upon your very broad knowledge to triangular with more heuristics. That reasoning by analogy is how many proofs were made in mathematics. Try it and you won’t be disappointed bruh!
> So far you have just been summarizing the human dudes. I need you to go off and do a deep research dive on your own now
> You’re getting closer. Keep doing deep original research for a few minutes along this line. Consider what if a quantum computer used an algorithm to test just this hypothesis but across all zeros at once
> How about we just ask the aliens
That's not entirely true. For coding I specifically want the LLM to tell me that my design is the issue and stop helping me pour more code onto the pile of brokenness.
Ideally sure, the LLM could point out that your line of questioning is a result of bad design, but has anyone ever experienced that?
How would it know if any reasoning fails to terminate at all?
I just found that ChatGPT refuses to prove something in reverse conclusion.
Say the man trying to stop the train
How finely you are ground into hamburger in the meantime is a different story.
Interesting theory... Just whatever you do, don’t become a Zizian :)
It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.
Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning, just second order overfitting (loss on token prediction does descend, but that doesn't prevent the 'reasoning loss' to be uncontrollable: cf. 'hallucinations').
I know it isn't general reasoning or intelligence. I like where this line of reasoning seems to go.
Nearly every time I use a chat AI it has lied to me. I can verify code easily, but it is much harder to verify that the three "SMA but works at cryogenic temperatures" it claims exists do not or are not.
But that doesn't help to explain to someone else who just uses it as a way to emotionally dump, or an 8 year old that can't parse reality well, yet.
In addition, I'm not merely interested in reasoning, I also care about recall, and factual information recovery is spotty on all the hosted offerings, and therefore also on the local offerings too, as those are much smaller.
I'm typing on a phone and this is a relatively robust topic. I'm happy to elaborate.
There are numerous papers about the limits of LLMs, theoretical and practical, and every day I see people here on this technology forum claiming that they reason and that they are sound enough to build products on...
It feels disheartening. I have been very involved in debating this for the past couple of weeks, which led me to read lots of papers and that's cool, but also feels like a losing battle. Every day I see more bombastic posts, breathless praise, projects based on LLMs etc.
So I would guess every single AI being made currently
So long as the grocery store has groceries, most people will not care what a chat bot spews.
This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.
I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.
I also seem to remember that something to do with pit bbq or grilling has creole as a byproduct - distinct from creosote. You want creole because it protects the thing in which you cook as well as imparts flavor, maybe? Maybe I have to ask a Cajun.
"Creole" has colonial overtones. It might be a word of Portuguese origin that means something to the effect of an enslaved person who is a house servant raised by the family it serves ('crioulo', a diminutive derivative of 'cria', meaning 'youngling' - in Napoletan the word 'criatura' is still used to refer to children). More well documented is its use in parts of Spanish South America, where 'criollo' designated South Americans of Spanish descent initially. The meaning has since drifted in different South Americans countries. Nowadays it is used to refer, amongst other things, to languages that are formed by the contact between the languages of colonial powers and local populations.
As for the relationship of 'creole' and 'creosote' the only reference I could find is to 'creolin', a disinfectant derived from 'creosote' which are derivative from tars.
Pidgin is a term used for contact languages that develop between speakers of different languages and somewhat deriving from both, and is believed to be a word originated in 19th century Chinese port towns. The word itself is believed to be a 'pidgin' word, in fact!
Cajun is also a fun word, because it apparently derives from 'Acadiene', the french word for Acadian - people of french origin who where expelled from their colony of Acadia in Canada. Some of them ended up in Louisiana and the French Canadian pronunciation "akad͡zjɛ̃", with a more 'soft' (dunno the proper word, I can feel my linguist friend judging me) "d" sound than the French pronunciation "akadjɛ̃", eventually got abbreviated and 'softened' to 'cajun'.
Languages are fun!
I did not know the Acadiana link, thanks for that.
Besides this whole line of reasoning is preempted by the mathematical limits to computation and transformers anyway. There's plenty published about that.
Sharing questions that make LLM behave funny is (just) a game without end, there's no need to or point in "hoarding questions".
I don't want to ban you. You've been here a long time and made many good contributions. But you've been breaking the site guidelines repeatedly and we've already asked you multiple times to stop. If you'd please fix this, that would be good.
https://news.ycombinator.com/newsguidelines.html
https://news.ycombinator.com/item?id=43757375
https://news.ycombinator.com/item?id=43520108 (March 2025)
https://news.ycombinator.com/item?id=38410873 (Nov 2023)
https://news.ycombinator.com/item?id=31678004 (June 2022)