Basically, the bitter lesson: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...
> Don’t be distracted by human knowledge, as AI has been historically.
> Instead focus on methods for creating knowledge that scale with computation, like search and learning.
so the lesson is choose methods that scale with computation, not just that blindly scaling up anything (data, params, people, whatever) works, it is choosing the right x axis and the right scaling laws consistently wins out in the long run despite short term wins from other methods.
The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.
Also would love to know if the same Legal team advised on Gemini...
- V3 https://arxiv.org/abs/2412.19437
- V2 https://arxiv.org/abs/2405.04434
- R1 https://arxiv.org/abs/2501.12948 (RL applied to ML models was well-known beforehand, but they show it in the open, at scale, on big models)
Then, there's the incentive analysis. If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare. I'm not saying there's noone cooking a new architecture, just that it is a pretty rare event. And it would have to come from some researchers that would be happy to not publish their findings, which is not really what a sizable portion of elite researchers (obviously not all) are incentivized to do.
Of course, it's a bit of a verbal compression to claim simply 'scaled up'. They are recognisable scaled up transformers, but most new models come with a few tricks, but we're at the point where those usually are not an architectural rewrite and added to solve an explicit problem, like hallucination, not for big new capability gains.
c.f. hardware lotter https://arxiv.org/abs/2009.06489
There have been minor changes to the architecture over the years, but these are basically all efficiency tweaks such as various types of attention (some pioneered in the open by DeepSeek) that better scale to large context lengths, and the confusingly named "mixture of experts" architecture, but what's more notable really is how little the architecture has changed. The capability gains have been coming from better training and better data.
If you can make the existing model faster, you can then save your inference budget to then make your model bigger, which then makes it smarter.
A lot of how smart the models can be comes down to budget. If you can make your existing thing cheaper, you can instead make it bigger for the same price.
There's diminishing returns and at some point making a model bigger makes it dumber.
(Not trying to flame bait or anything. I just wouldn’t call LLM as exhibiting intelligence. It is great at making connections based on probability but doesn’t have a semantic understanding of what it is doing)
ReAct loops and tool-calling are the critical development feature. They turn a model from something that generates text into something that can independently influence the world around them.
Without agent features, you have just a chatbot.
You can go another step - a FFN can be simulated on a Turing machine, thus it just exemplifies the incredible semantical power of the Turing machine model of computation. (in fact you don't even need a Turing machine, since there is no looping in one forward pass).
In theory you can run a huge FFN on the tiniest Turing machine, in practice it's much better to run a Transformer on the latest NVIDIA hardware. Or as they say "quantity (performance) has a quality all its own"
There is also the case for Markov chains being theoretically able to do these if tuned well. Or even SAT problem.
(If I can be honest, and I am not being disparaging about anything lest it might seem so, I am looking at it from a career breakthrough/move perspective rather than an intellectual pursuit.)
If you want to be a researcher and come out with the next breakthrough, get ready to go back to school and learn some math.
If you just need to learn how to use it well and build things with it, then you probably just need to have a high level understanding.
Same as programming. I’d bet most programmers have no idea about the physics that makes computers work.
What about improving the efficiency of token consumption, etc., basically opportunities for improving cost/performance?
I keep thinking there has to be a better way to share context with models than dumping entire gigantic skill files of raw text or otherwise into them - I'm betting there's a bunch of low-hanging fruit there.
Which sums up HN these days.
I have no idea about careers at this point, I’m still doing fancy IT work as my day job I and look away from the future with dread. I also haven’t been looking for new roles on the open job market, so who knows maybe there’s multimillion pay packages for anyone who can articulate how attention works in an interview.
https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...
https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...
"Beating Nyquist with Compressed Sensing" - https://youtu.be/A8W1I3mtjp8
I have to wonder though. Is this all a human brain is? A similar thing to an LLM just scaled exponentially larger. I mean a brain is not just neurons with simple connections to each other. The neurons, axons, dendrites, <insert_unexplained_thing>, etc in a brain are all holding and processing information in different ways and doing it nearly 100% in parallel. That's a really big model.
The biological discoveries show how complex a biological brain actually is. Even the tiny brains in a bee or spider are able to solve puzzles and use tools. That's crazy.
If we look beyond written languages which are late inventions of human civilization, oral languages are continuous and build with blocks not words.
Chomskyan school misled the entire field of linguistics for decades by ignoring spoken languages.
Example: a programming language's capability to produce complex software does not come from some inherent quality of language. It comes from binary. 0's and 1's, representing basic logic, and that being built on top of with an abstract "tool" called a language. If the binary logic didn't work, the language wouldn't do anything.
A dolphin can make sounds, and technically has a language, but they can't manipulate or recursively compound concepts (as far as we can tell) in order to create modified ideas. If they could, they probably would have come up with vastly more advanced fishing methods than the (admittedly novel) ones they have now.
As this description is so overly abstract, an exercise for the reader is to try to work through an explanation of how, say, a river delta comes to "learn" about its environment by "reacting" to the influences at its borders, and how it "encodes" whatever it is that it learns in the substrate that it inhabits.
So you're missing a lot of the building blocks that make LLMs. It's not a matter of just having the compute.
Like the best leaps in thinking, once it is made, is is immediately obvious and intuitive.
Residual connections are so simple, so obvious and so vital. Yet nobody came up with them until 2015?
I think as time went on, and hardware got better, it seemed more reasonable to actually think about a viable implementation of what I think was a widespread intuition anyone in ML had that everything's context is everything.
It just seemed like a theoretical thing until hardware caught up. Maybe. Perhaps I'm applying a retrospective excuse to why it took so long.
I don't think it was intuitive to anyone back then, the vanishing gradient problem was a big deal since the dawn of NNs. I'm not sure what you mean by sheer computation, residuals allow you to have deep networks instead of shallow and wide ones. You can have equivalent parameter count.
No, it's not. There are many animals that have extremely complex and even learned behaviour that have literally zero neurons.
Clearly "neurons" is an oversimplification just-so story, not a scientific theory.
MoE was also pretty straightforward, just a bit surprising how well it worked (that you can get away with just 1/32 active parameters), but most researchers would have come up with it on their own probably.
The true ground breaking papers are the first two you mentioned (transformers and gpt2), and InstructGPT was also very surprising that it worked so well.
Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165
I also enjoyed the papers for DeepSeek and GLM for an overview of all the tricks you need to make these things work
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models https://arxiv.org/abs/2512.02556
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models https://arxiv.org/abs/2508.06471
Tangentially related: This part always seemed fuzzy to me, especially when dealing with data scientists and how they talk about how 'ML' looks at problems. I had this issue when working at a SIEM vendor where they kept going on about use case development having to be designed a certain way to catch things. It was all very frustrating.
Did you mean to link to the video? I would be interested.
I picked it up from trying to teach myself that SLAM stuff. The papers are very short, but highly information dense and at the time there was no ChatGPT to help me. I got through them by just creeping my way through the math with a whiteboard, and something about drawing it out and having it there in my office made it all click. Trying to watch piecemeal lectures on YouTube or grind through foundational books like MVG just didn’t work for me, I used them instead as references for my drawings.
Same happened when I tried learning this GPT stuff. karpathy’s videos were out at the time, but I couldn’t really stay focused on them or connect the math with the code. Most other descriptions I could find were focused on getting you to use their inference library or harness. Assembling the picture together on my whiteboard by focusing on drawing out the block diagram continues to be my personal favorite method for deep understanding of complex systems.
I don't think there is anything in a transformer I couldn't explain in the smallest detail now.
[0]: https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...
[1]: https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...
If you're up for it I would love to know how and why positional encodings work
A vanilla self-attention layer is just a set of token vectors. Without positional info, swapping two identical embeddings changes very little about what attention can compute. We can "fix" this problem by using positional encodings. Text that has meaning isn't just a set of characters; the location and order of those characters is what provides meaning.
It doesn't has any impact?
Ah wait it does. Mh weird.
Why are you not creating a startup and get rich?
how did you know about the steps and there was math involved. i am curious about your process and you came up with what exactly to learn to unravel the mystery.
Einstein special relativity is taught these days in high-schools. Doesn't mean it wasn't the very hard part at some point in time.
As they say, shoulders of giants.
We still don’t really know why they work, we just know how to build them.
My next child took a completely different path to language, including skipping all the non-verbal imitations.
And then at some point, you just suddenly can two-way communicate with them when you couldn't before, and then after that, they can engage in reasoning.
It’s interesting to me how similar attempting to understand LLMs is to neuroscience.
“When we turn this bit off, this other thing happens… if we change these weights the Eiffel Tower is now in Rome”
We’re basically just probing around and trying to reverse engineer an emergent system.
To your point, this system may be quite different from model to model (human to human) although some similarities likely occur.
The comment I was responding to tried to belittle the OP’s understanding of transformers, by mentioning that running an LLM at scale is much harder than the simple white board diagram.
My point was simply that we don’t know why they work, and all the extra optimizations isn’t the “thing” that makes it emergent.
Simply scaling the “GPT” is good enough to see it, so the OP’s awe should stand.
(On a side note, what other architectures can we scale to find similar emergent behavior?)
Adults are expected to have their world models approximately correct in terms of physical environment so they won’t accidentally kill themselves by falling off a cliff; then there are the social norms which adults are expected to conform to so everyone is kinda predictable to everyone else so adults don’t kill each other too often over food or mates. Understanding of neither is expected from children.
I think they're right that kids (at least in the US) are generally treated as less capable than they are, and it ends up slightly delaying their development.
My son is very worried about black holes lately when he learned anything that goes into one can't get out. He's pretty concerned astronauts could get stuck in one some day. So I explained to him that Hawking radiation does actually mean you can eventually get out; it just takes some time.
I didn't think it pertinent to mention spaghettification, the fact anywhere near a black hole will be really hot, or that cosmic censorship means whatever Hawking-radiates from a black hole wouldn't be an astronaut anymore.
It was also fun to hear Hawking speak. He wanted to know if Hawking was a robot. I said no, but he has a robot talk for him. Not quite true, but close enough.
The "bitter lesson" is that fake-it-till-you-make-it is a valid way of doing knowledge work.
(Or not make it, then people will just claim you're holding the LLM wrong and it's not the AI's fault.)
Statistically most likely in what context, given which preconditions? Because each prompt sequence is unique so the probability of any token following it is unknown.
If you’re talking about matrix multiplication, I can use mathematical rules and axioms and proves formally that the multiplication is correct. For next token prediction, I can prove that the set of tokens is finite and that the next token is always part of that set.
But things like grammar correctness, or semantic consistency over a few sentences are not hardcoded rules in the model. They’re emergent properties, mostly due to the amount and quality of data available for training. Quantization is mostly about how much we can shed without loosing a particular emergent properties (like dithering or psycho acoustic audio compression)
You know it perfectly damn well that a typical person's idea of statistics is not some insanely high cardinality stateful prediction, but a "well a coin toss is a 50:50, and a lottery win is a 1:100000000". You also know it perfectly damn well that as a result, people will just think that all the sentences chatbots ever produced to them were then just somewhere in the massive training set, letter by letter. This insinuation is often even explicitly appealed to.
And that picture is outright false. It's a statistical process, yes, so saying that it does what it does by "just doing statistics" is gonna be a generally correct description, but that's not at all inquisitive to how exactly does it do it, nor is it the zinger you think it is. If you did the aforementioned, you'd just get milquetoast nonsense, like you can see in the countless Markov-chain primers. And while the models do have a lot of the training set lossily captured, they do also absolutely generalize (that's how they can do that lossy compression), and you can quite literally find representations of those generalizations in them, and also see them activate.
It's like summarizing how any program works by just saying "well it just manipulates ones and zeroes". Not very informative, is it? Or how programs are written by just programmers sitting in a cushy office, ryhtmically pressing keys on a keyboard. Not a very fair or insightful description, which you'll know if you've done any amount of programming in your life on your own. Extends to all other white collar jobs too.
It's also not even true in the most literal sense: models can and do absolutely choose a less than maximally likely next token, that's what the various decoding parameters are for. "Maximally likely next token" further conviently skipping over how that likelihood is established in the first place, i.e. the literal point of the question, going in a cute little circle.
I'm so over this "stochastic parrot" bullshit.