TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:
> 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)
So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.
(Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)
I dare say that in some ways, we understand LLMs better than humans, or at least the interpretability tools are now superior. Awkward place to be, but an interesting one.
Are you surprised we understand them better than brains?
That's a bit of an overstatement.
The entire field of ML is aimed at problems where deterministic code would work just fine, but the amount of cases it would need to cover is too large to be practical (note, this has nothing to do with the impossibility of its design) AND there's a sufficient corpus of data that allows plausible enough models to be trained. So we accept the occasionally questionable precision of ML models over the huge time and money costs of engineering these kinds of systems the traditional way. LLMs are no different.
What you are saying is fantasy nonsense.
> but the amount of cases it would need to cover is too large to be practical (note, this has nothing to do with the impossibility of its design)
So it doesn't work.
You would be sorely mistaken to think I'm utterly uninformed about LLM-research, even if I would never dare to claim to be a domain expert.
LLMs draw origins from, both n-gram language models (ca. 1990s) and neural networks and deep learning (ca. 2000). So we've only had really good ones maybe 6-8 years or so, but the roots of the study go back 30 years at least.
Psychiatry, psychology, and neurology on the other hand, are really only roughly 150 years old. Before that, there wasn't enough information about the human body to be able to study it, let alone the resources or biochemical knowledge necessary to be able to understand it or do much of anything with it.
So, sure, we've studied it longer. But only 5 times longer. And, I mean, we've studied language, geometry, and reasoning for literally thousands of years. Markov chains are like 120 years old, so older than computer science, and you need those to make an LLM.
And if you think we went down some dead-end directions with language models in the last 30 years, boy, have I got some bad news for you about how badly we botched psychiatry, psychology, and neurology!
Very, monsieur Laplace.
We have tons of low-hanging fruits across all fields of science and engineering to be picked, in form of different ways to apply and chain the models we have, different ways to interact with them, etc. - enough to fuel a good decade of continued progress in everything.
> What is (not) here to stay are the techbros who think every problem can be solved with LLMs.
LLMs are in all likelyhood here to stay, but the scumbags doing business around them right now are hopefully going away eventually.
Much as Diogenes mocked Platos definition of a man with a plucked chicken, LLM's revealed what "real" ai would require: contigous learning. That isnt to diminish the power of LLM's (the are useful) but that limitation is a fairly hard one to over come if true AGI is your goal.
From what I understand, a living neural network learns several orders of magnitude more efficiently than an artificial one.
I'm not sure where that difference comes from. But my brain probably isn't doing back propagation, it's probably doing something very different.
(eg different kinds of learning for long-term memory, short-term memory, languages, faces and reflexes.)
The intersection of what with physics?
Sir Roger Penrose, on quantum consciousness (and there is some regret on his part here) -- OR -- Jacob Barandes for a much more current thinking on this sort of intersectional exploratory thinking.
> The earliest reference to the brain occurs in the Edwin Smith Surgical Papyrus, written in the 17th century BC.
I was actually thinking of ancient greeks when writing my comment, but I suppose Egyptians have even older records than them.
I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.
I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.
[0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
[1] https://developers.redhat.com/articles/2025/06/03/structured...
Making coding agents spit out syntactically correct code token by token is like asking a human to code on a whiteboard.
We kinda have a little bit of it with some coding harnesses giving model access to LSP, but I think that we can insert this knowledge on a lower level if we find a clever way to somehow utilize it during sampling.
I think that there is a lot of low hanging fruit in this area.
And in general, I think that people try to use LLMs too much to solve problems that can be easily solved by cheaper (computationally), and, more importantly deterministic tools.
For example, back in the day when LLM-assisted coding just became a thing people very often complained about models generating syntactically incorrect code and inventing non-existent library methods.
Well, I, an experienced human programmer, probably would also be making syntax mistakes and inventing non-existent methods if you stripped me of my tools and made me write code in a bare text editor without syntax highlighting.
Thankfully, my IDE would autocomplete real syntax and actually existing library methods for me and immediately give me feedback if I make a mistake anyway. And all of it is achieved using reliable deterministic code without the inherent issues of statistical models.
I think that it is really inefficient to reach for an expensive and unreliable tool when a cheap and reliable tool will do.
1. code
2. syntax check / build / format / lint (details language dependent)
3. test
and they can hop between 1 and 2 however many times they want.
I do think there is some merit in a tool that dumps all namespaces and reachable symbols so the agent can do its own autocomplete without a round-trip.
As a human coder you don’t summon intellisense. It’s just popped up into your visual field as extra input - contextual cues.
You could force intellisense state into the context vector the LLM receives.
i once asked an LLM if it could ingest code from an interactive session more easily if it were in appropriately-typed markdown fences and it said absolutely yes, and that the syntax highlighting fed to it that way helps it immensely. i was downright shocked that syntax highlighting was anything more than noise for them.
I think speculative decoding count as a (perhaps crude) way implementing this?
There's a lot of work going on in various streams towards making it possible to vary compute per-token, dynamically, e.g. universal transformers. Maybe one day it'll work well enough to beat conventional techniques.
I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...
> This is probably due to the way larger numbers are tokenised, as big numbers can be split up into arbitrary forms. Take the integer 123456789. A BPE tokenizer (e.g., GPT-style) might split it like: ‘123’ ‘456’ ‘789’ or: ‘12’ ‘345’ ‘67’ ‘89’
One of the craziest LLM hacks that doesn't get love is https://polymathic-ai.org/blog/xval/
xVal basically says "tokenizing numbers is hard: what if instead of outputting tokens that combine to represent numbers, we just output the numbers themselves, right there in the output embedding?"
It works! Imagine you're discussing math with someone. Instead of saying "x is twenty five, which is large" in words, you'd say "x is", then switch to making a whistling noise in which the pitch of your whistle, in its position within your output frequency range, communicated the concept of 25.00 +/- epsilon. Then you'd resume speech and say "which is large".
I think the sentiment is that today's models are big and well-trained enough that receiving and delivering quantities as tokens representing numbers doesn't hurt capabilities much, but I'm still fascinated by xVal's much more elegant approach.
Completely artistic creation, creating something that does not exist and that cannot produce things out of itself, means that locking can be more diffuse, not as settled.
>I love that we're still learning the emergent properties of LLMs!
There are tons of low-hanging fruits there.
In nemotron the high perplexity solutions are selected for RL, in VLM training a few people are looking at the entropy distributions of the training set, etc
I think you are implying a reverse causation. They used a metaphor from us.
"Simple Self-Distillation". We had an acronym for Solid-State Drive. Don't know about that technique but the naming sure sound.. Simple?