upvote
We don't really know why language works with humans, either. If you raise a baby from birth, you kind of observe how it is learning language, but the process is also rather mysterious. My eldest son's first word was to actually imitate a cow mooing, and then after that to imitate a motor noise of a tractor or truck. And then after that a meow. (His first complete sentence was "King Graham fell"...)

My next child took a completely different path to language, including skipping all the non-verbal imitations.

And then at some point, you just suddenly can two-way communicate with them when you couldn't before, and then after that, they can engage in reasoning.

reply
Completely agree!

It’s interesting to me how similar attempting to understand LLMs is to neuroscience.

“When we turn this bit off, this other thing happens… if we change these weights the Eiffel Tower is now in Rome”

We’re basically just probing around and trying to reverse engineer an emergent system.

To your point, this system may be quite different from model to model (human to human) although some similarities likely occur.

The comment I was responding to tried to belittle the OP’s understanding of transformers, by mentioning that running an LLM at scale is much harder than the simple white board diagram.

My point was simply that we don’t know why they work, and all the extra optimizations isn’t the “thing” that makes it emergent.

Simply scaling the “GPT” is good enough to see it, so the OP’s awe should stand.

(On a side note, what other architectures can we scale to find similar emergent behavior?)

reply
Computer vision ends up displaying emergent behaviour. It just "figures out" things.
reply
Human brain capabilities are truly amazing, imagine if people didn’t treat their children as if they are stupid and didn’t constantly lie to them, because kids are stupid right, they wouldn’t understand. What heights could be reached.
reply
We don’t treat children like they’re stupid, we treat children like they’re children. A stupid adult is treated very differently than any child.

Adults are expected to have their world models approximately correct in terms of physical environment so they won’t accidentally kill themselves by falling off a cliff; then there are the social norms which adults are expected to conform to so everyone is kinda predictable to everyone else so adults don’t kill each other too often over food or mates. Understanding of neither is expected from children.

reply
Another example, my parents taught me to read at about 4 years old. When I started kindergarten (the year before 1st grade in the US), the teachers and principal didn't believe I could read and I had to prove it by reading a book to them I'd never seen before.

I think they're right that kids (at least in the US) are generally treated as less capable than they are, and it ends up slightly delaying their development.

reply
You may have been raised properly since you don’t get what I mean. I really envy kids with “Chinese parents” that had them learn math early on and not some bullshit like that if you put your tooth under your pillow, then a tooth fairy will come.
reply
I think those 2 are orthogonal. Math still works with Santa or the tooth fairy.
reply
Maybe math works but critical thinking doesn’t. There are people who have lived for many decades without ever questioning insane b.s. they were taught as kids.
reply
It is possible to have learned both things you know.
reply
I had to learn maths early (not chinese or asian) and also a bunch of scary stories to make me behave. I would have been glad to learn about fairies.
reply
They aren't stupid, but they aren't quite ready to handle the full responsibilities of the world and worry about things they don't need to worry about.

My son is very worried about black holes lately when he learned anything that goes into one can't get out. He's pretty concerned astronauts could get stuck in one some day. So I explained to him that Hawking radiation does actually mean you can eventually get out; it just takes some time.

I didn't think it pertinent to mention spaghettification, the fact anywhere near a black hole will be really hot, or that cosmic censorship means whatever Hawking-radiates from a black hole wouldn't be an astronaut anymore.

It was also fun to hear Hawking speak. He wanted to know if Hawking was a robot. I said no, but he has a robot talk for him. Not quite true, but close enough.

reply
Because god forbid that childhood, the one time in your life when you don't have any responsibilities, should be fun.
reply
Waste 22 years of life without learning anything and then slave away at a 9-5 job you hate. Brilliant strategy. At least you had “fun”. Then blame billionaires or something.
reply
Childhood only lasts 13 to 15 years where I am. By the time you’re in high school, you can be expected to be responsible in some matters. By 22 you have 7 years of experience in making decisions for yourself.
reply
Hm, I wonder if it's more that we're shocked such a simple thing (relatively speaking) can work so well.
reply
It was precisely that for me! Another commenter captures it well; “the bitter lesson” indeed.
reply
We do know how they work. They predict the next statistically most likely token.

The "bitter lesson" is that fake-it-till-you-make-it is a valid way of doing knowledge work.

(Or not make it, then people will just claim you're holding the LLM wrong and it's not the AI's fault.)

reply
Sufficiently good iterated next token prediction is an AI hard problem.
reply
> statistically most likely token.

Statistically most likely in what context, given which preconditions? Because each prompt sequence is unique so the probability of any token following it is unknown.

reply
It’s not unknown because that’s what the model computes. It’s matrix multiplication just like shaders.
reply
And how do you know that the model computes it correctly?
reply
Correctness is based on axioms and rules. You need to define your axioms and rules first before you can determine correctness.

If you’re talking about matrix multiplication, I can use mathematical rules and axioms and proves formally that the multiplication is correct. For next token prediction, I can prove that the set of tokens is finite and that the next token is always part of that set.

But things like grammar correctness, or semantic consistency over a few sentences are not hardcoded rules in the model. They’re emergent properties, mostly due to the amount and quality of data available for training. Quantization is mostly about how much we can shed without loosing a particular emergent properties (like dithering or psycho acoustic audio compression)

reply
This "they just predict the next statistically most likely token" is such an handwavey and willfully misleading explanation, it's unreal, and I'm so fucking tired of seeing it so incessantly repeated. It's beyond asinine.

You know it perfectly damn well that a typical person's idea of statistics is not some insanely high cardinality stateful prediction, but a "well a coin toss is a 50:50, and a lottery win is a 1:100000000". You also know it perfectly damn well that as a result, people will just think that all the sentences chatbots ever produced to them were then just somewhere in the massive training set, letter by letter. This insinuation is often even explicitly appealed to.

And that picture is outright false. It's a statistical process, yes, so saying that it does what it does by "just doing statistics" is gonna be a generally correct description, but that's not at all inquisitive to how exactly does it do it, nor is it the zinger you think it is. If you did the aforementioned, you'd just get milquetoast nonsense, like you can see in the countless Markov-chain primers. And while the models do have a lot of the training set lossily captured, they do also absolutely generalize (that's how they can do that lossy compression), and you can quite literally find representations of those generalizations in them, and also see them activate.

It's like summarizing how any program works by just saying "well it just manipulates ones and zeroes". Not very informative, is it? Or how programs are written by just programmers sitting in a cushy office, ryhtmically pressing keys on a keyboard. Not a very fair or insightful description, which you'll know if you've done any amount of programming in your life on your own. Extends to all other white collar jobs too.

It's also not even true in the most literal sense: models can and do absolutely choose a less than maximally likely next token, that's what the various decoding parameters are for. "Maximally likely next token" further conviently skipping over how that likelihood is established in the first place, i.e. the literal point of the question, going in a cute little circle.

I'm so over this "stochastic parrot" bullshit.

reply