A Theory of Deep Learning

(elonlit.com)

203 points

by elonlit1 days ago |

48 comments

by Macuyiko25 minutes ago|

[-]

Reminded me strongly of the paper "Deep Learning is Not So Mysterious or Different" from a year ago: https://arxiv.org/abs/2503.02113

by kleiba22 hours ago|

prev|

[-]

> This exact characterization is possible because in output space, training dynamics can be understood through a locally linear differential equation along the realized path, where dominant eigenmodes of the evolving kernel equilibrate exponentially fast. Forcing an optimizer to slowly step through these solved directions is highly inefficient and suggests a path to analytically jump to the final network state.

But at what computational cost?

by ks204815 hours ago|

prev|

[-]

The relevant paper: "A Theory of Generalization in Deep Learning". https://arxiv.org/abs/2605.01172

by pixelpoet9 hours ago|

parent|

[-]

I interpreted the kernel K of this paper as the BRDF in Rendering Equation[0] and its familiar diffusion process (from light transport simulation, or really any integro-differential equation system); together with https://en.wikipedia.org/wiki/Neural_tangent_kernel I hope this paper might be accessible with some study

[0] https://en.wikipedia.org/wiki/Rendering_equation

by r0ze-at-hn3 hours ago|

prev|

[-]

Linking to the paper: https://arxiv.org/pdf/2605.01172 which is also a fantastic read, the application to deep learning is good. It does a lot of cross-mapping and highlighting a bunch of old stuff that is named differently in this paper and worth calling out for those with those backgrounds:

"Cumulative Dissipation Gramian" Ws = Observability Gramian (from Control Theory). For example the spectral cutoff is exactly the Hankel singular value truncation from model reduction.

"Signal Channel" / "Reservoir" is Controllable/Observable vs. Uncontrollable/Unobservable Subspaces. Using Adamjan-Arov-Krein (AAK) theory gives the optimal nonlinear reduced model answering the optimal compression question.

"Drift–Diffusion Separation" is Freidlin-Wentzell Large Deviation Theory. They can predict "grokking" time from the FW action.

"Population-Risk Gate" is Quantum Weak Value / Postselection (Aharonov)

So for the follow-up problems

Control theory gives the truncation error bounds for model compression. Large deviation theory gives the grokking time predictions. Quantum measurement theory gives the imaginary preconditioners. Information geometry gives the optimal continuous relaxation of the gate.

Some nice implications of new ways of doing stuff which are nice to see formalized here:

Old: Pick architecture, hope it generalizes New: Design architecture to maximize observability Gramian rank (Honestly we pull a lot from control theory here)

Old: Use validation set to detect overfitting New: Monitor λ(Ws) spectrum during training; no validation needed

Old: Prune post-hoc based on magnitude New: Prune during training based on ker(Ws) membership

Old: Fixed learning rate New: Spectral learning rate

by prideout16 hours ago|

prev|

[-]

This is a fascinating mathematical framework, but the post title might be a bit of an overreach. I often wonder if "a theory of deep learning" could exist that could be stated succinctly and that could predict (1) scaling laws and (2) the surprising reliability of gradient descent.

Note that I said "predict" not "describe". It feels like we're still in the era of Kepler, not Newton.

by sdenton412 hours ago|

parent|

[-]

I dunno... gradient descent is only really reliable with a big bag of tricks. Knowing good initializations is a starting point, but recurrent connections and batch/layer normalization go a very long way towards making it reliable.

by hellohello211 hours ago|

parent|

[-]

I agree, this is the correct way to see it IMO. Instead of designing better optimizers, we designed easier parameterizations to optimize. The surprising part is that these parameterizations exist in the first place.

by sigmoid104 hours ago|

parent|

[-]

Gradient descent is mathematically the most efficient optimization strategy (safe for some special functions) in high dimensions. This goes so far that people nowadays even believe it has to be used in the human brain [1], if only because every other method of updating the brain would be way too energy inefficient. From that perspective, finding the right parameterization was all we ever needed to achieve AI.

[1] https://physoc.onlinelibrary.wiley.com/doi/full/10.1113/JP28...

by scarmig2 hours ago|

parent|

[-]

Even in supervised ML, pure gradient descent is not the most efficient optimization strategy. E.g., momentum is ubiquitous, and the updates it induces cannot be expressed as a gradient of some scalar loss. But the rotational non-gradient component of its updates substantially improves performance and convergence on the architectures we use.

The brain probably primarily uses something like TD for task learning, which is also not expressible as a gradient of any objective function. And, though the paper mentions Hebbian learning, it's only very particular network architectures (e.g. single neuron; symmetric connections) that you can treat its updates as a gradient of some energy function; these architectures aren't anything close to what we see in the brain.

by jldugger6 hours ago|

parent|

prev|

[-]

> but the post title might be a bit of an overreach.

Really? An essay that leads off with a Borges anecdote skewed grandiose. Oh my, how unprecedented!

by arolihas15 hours ago|

prev|

[-]

Idk to me this is just redescribing what deep neural networks do without actually explaining why anything happens. I guess it "unifies" things but I am kinda over most unifying theories. Everything is Bayesian, everything is a graph or a group or some other fancy geometric structure, everything is a category. Ultimately the best framework is whatever is useful enough to explain what's happening in such a way that a practitioner can manipulate the model towards a desired outcome. In other words, where is the knob? The tool they share may be interesting and I hope to play with it to see what happens at different levels of noise applied to the labels.

by po1nt9 hours ago|

parent|

[-]

We're still in the era of room-sized-computers-only-scientists-understand era of the neural networks. Knobs and buttons for nerds are slowly coming.

by ipnon10 hours ago|

parent|

prev|

[-]

A real theory would predict phenomena thus far unseen. We already know about this 4 part taxonomy.

by T-A2 hours ago|

parent|

[-]

Did you also know about this?

Lastly, we derive an exact population-risk objective from a single training run with no validation data, for any architecture, loss, or optimizer, and prove that it measures precisely the noise in the signal channel. This objective reduces in practice to an SNR preconditioner on top of Adam, adding one state vector at no extra cost; it accelerates grokking by 5x, suppresses memorization in PINNs and implicit neural representations, and improves DPO fine-tuning under noisy preferences while staying 3x closer to the reference policy. [1]

[1] https://arxiv.org/abs/2605.01172

by throwjjj11 hours ago|

parent|

prev|

[-]

[dead]

by smokel15 hours ago|

prev|

[-]

This essay seems to be related to the paper "There Will Be a Scientific Theory of Deep Learning" [1] which was discussed here recently [2].

[1] https://arxiv.org/pdf/2604.21691

[2] https://news.ycombinator.com/item?id=47893779

by minimaltom12 hours ago|

prev|

[-]

> That is, if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it. This is a one-line change to Adam that accelerates grokking by 5x, suppresses memorization in PINNs, and improves DPO fine-tuning, eliminating the need for validation sets entirely.

Does anyone understand the formula they expressed above this sentence? is this just the classic "skip updating parameters with high gradient/loss variance in multiple batches/samples" ?

by yorwba4 hours ago|

parent|

[-]

What is classic about "skip updating parameters with high gradient/loss variance in multiple batches/samples"? Do you have a particular algorithm in mind that uses this heuristic?

by jeffrallen40 minutes ago|

prev|

[-]

This looks like excellent work, it's reminding me of things I learned from Welch Labs vidoes. Given the amount of time I budget for keeping up with this stuff (regrettably too low) I'll wait until Welch Labs presents something on this.

by hashta14 hours ago|

prev|

[-]

Interesting read. I remember the grokking paper when it came out but I don't think I've ever seen that classic grokking loss curve in my own hands on real data. Curious if others have seen it more often in practice

by yorwba4 hours ago|

parent|

[-]

To get pure grokking, you need a model large enough to easily memorize the entire training data and keep training for a long time after memorization. In practice, you'll probably use a more realistically-sized model that might grok on some subset of the data, but not so strongly that it's extremely obvious.

by auggierose1 hours ago|

prev|

[-]

Looks like a typical machine learning paper to me. It cannot be understood unless you already kind of understand it. That is OK for communication with peers, but eventually I expect a "theory of" to be readable by anyone with a math degree.

by jdw6415 hours ago|

prev|

[-]

Does anyone happen to know what font this site is using? It looks really elegant.

by airza15 hours ago|

parent|

[-]

It is a modified version of ET_Book called ET_Bembo:

https://github.com/DavidBarts/ET_Bembo

by jdw6415 hours ago|

parent|

[-]

I love u. thanks!

by DataDaoDe15 hours ago|

parent|

prev|

[-]

apparently its the font used in Edward Tufte's books. Its on github: https://edwardtufte.github.io/et-book/

by cschmidt15 hours ago|

parent|

[-]

The "Quantitative Display of Information", which I just checked, is using Monotype Bembo. So still Bembo, but a different version.

by piskov15 hours ago|

parent|

prev|

[-]

Font is atrocious.

Uppercase letters have different stroke width than lowercase ones — it’s like they are *B*old *L*ike this.

Not only that: tracking, kerning is basically non-existent.

Please don’t use that open-source font

You need real paid Bembo, not that piece of shit.

by hashmap7 hours ago|

prev|

[-]

this landed precisely on like 3 weird bugs ive been hitting and solving in different stupid ways for dealing with things like sgd collapsing too many good answers into one bad answer, and gave me a real direction to try to fix the link missing in my own ml stuff. what timing. i have tried analytic solutions too and they're useful for like mapping prompts into memory geometry but from there ive ended up still having to use sgd. cause i think what happens is, sgd teaches the neural net both the geometry and how to navigate it. if you just teleport to the answer it doesnt learn how to walk.

by airza16 hours ago|

prev|

[-]

A very fascinating read.

As a fellow tufte css enjoyer, Why is user select turned off on the sidenotes? I would like to be able to copy paste them quite badly.

by piskov15 hours ago|

parent|

[-]

Layout is fine but font is atrocious.

Uppercase letters have different stroke width than lowercase ones — it’s like they are *B*old *L*ike this.

Not only that: tracking, kerning is basically non-existent.

Please don’t use that open-source font

You need real Bembo, not that piece of shit

by xiaodai13 hours ago|

prev|

[-]

What a beautifully written article. It's extremely that I favourite an article but this is one.

by renticulous6 hours ago|

parent|

[-]

The Hidden Physics of LLMs: Retrieval as Thermodynamics

https://www.youtube.com/watch?v=ppCZfjLdSY8

I found this video to be illustrative as well. Simple and anyone can understand.

by gravity1312 hours ago|

parent|

prev|

[-]

Very extremely. Quite a lovely presentation. I'm definitely having a Patrick Bateman-esque appreciation for that delicate cream background.

by refulgentis16 hours ago|

prev|

[-]

This is a beautifully written way of saying “Some parts of what the network memorizes affect test behavior, and some don’t.” But that’s not a theory of deep learning, the grand unified theory would explain that.

We're given a signal channel and a reservoir. Signal lives in the channel, noise lives in the reservoir, and the reservoir supposedly doesn’t show up at test time.

Okay, but then we have: why would SGD put the right things in the right bucket?

If the answer is “because the reservoir is defined as the stuff that doesn’t transfer to test,” then this is close to circular.

The Borges/Lavoisier stuff is a tell. "We have unified the field” rhetoric should come after nontrivial predictions and results. Claiming to solve benign overfitting, double descent, grokking, implicit bias, risk of training on population, how to avoid a validation set, and last but not least, skipping training by analytically jumping to the end is 6 theory papers, 3 NeurIPS winners, and a $10B startup. Let's get some results before we tell everyone we unified the field. :) I hope you're right.

by Chance-Device14 hours ago|

parent|

[-]

> why would SGD put the right things in the right bucket?

Think of it as a best fit curve and exceptions to that curve. The noise is essentially this set of exceptions that move points away from where they would otherwise fall on the curve.

Gradient descent wants to be able to make the smallest change that moves the most data points towards the curve. To do this it learns an arrangement where it can change, say, one parameter and have a bunch of points move at once. What does this correspond to? The big common patterns shared by many data points.

Most of the capacity gets soaked up modelling these sorts of common patterns, and after they have been learned the model starts adding exceptions that allow individual points to deviate from the curve.

Because they’re exceptions, they must not impact neighbouring points, or at least only ones within a very short distance from them. Otherwise they’re now driving the error higher by impacting more points than they should. So you end up with very narrow ranges of features that are able to trigger different sorts of noise.

How narrow they are is shaped by the training data, they’re exactly as narrow as needed not to raise the error, so assuming the total population has the same distribution, they don’t get hit. Much.

At least, that’s what I take away from it.

by dwrodri16 hours ago|

parent|

prev|

[-]

Admittedly probably some aggrandized boasting here, but I think empirical verification of that Adam modification alone would be a meaningful contribution, unless that's prior work?

by 3170706 hours ago|

parent|

[-]

A theory that skips the parameter space, and understands grokking theory, comes up with an unexplained update rule, which notably works on a per-parameter level by dropping the updates for most parameters.

I suspect there is going to be a lot of handwaving to actually go from eNTK to that new update rule.

I also doubt it helps in the non-grokking regime, given the focus of the theory, which is where all the practical applications I have ever heard from live.

Don't get me wrong, I did enjoy reading this essay. It's well written and reasonably argumented without going into details.

by yorwba4 hours ago|

parent|

[-]

The handwaving required is just to assume a diagonal preconditioner, and the optimal preconditioner under that constraint corresponds to the new update rule. (See section F of the paper.) And of course a diagonal preconditioner works on the per-paramer level.

by neosat14 hours ago|

parent|

prev|

[-]

If that's the case, a way to test the theory and understanding (assuming some parts of reservoir and signal channel can be reliably identified) would be to prune the high-confidence reservoir significantly reducing the model size while still getting good predictions. I don't believe the authors mention this (though I skimmed and didn't read the full paper in detail so I may be wrong)

by SubiculumCode5 hours ago|

parent|

prev|

[-]

I don't know the math, but this point was clear to me and it screamed, "crank" but not being sure of that because I am not learned enough to understand the math... but even I could tell the magnitude of the claim. Even just the removing the need for validation sets would have epic consequences across many fields.

by hariseldom14 hours ago|

parent|

prev|

[-]

These are the same complaints I had. Also felt like it was high quality ai writing, possibly because of the style choices like "Benign overfitting is noise sitting in the reservoir at interpolation. XYZ is ..." and because of the similarity it has to the times I ended up with chatgpt or gemini creating very detailed and plausible reports about my own crackpot or vague-enough-to-be-useless ideas.

by robot-wrangler14 hours ago|

parent|

prev|

[-]

> The Borges/Lavoisier stuff is a tell.

Nah, the softer stuff seems like valuable outreach / good science communication for people that aren't up for the math. Including probably lots of software engineers who are sick of dumb debates in forums, and starting to dip into the real literature and listen to better authorities. More people should do this really, since it's the only way to see past the marketing and hype from fully entrenched AI boosters or detractors. Neither of those groups is big on critical thinking, and they dominate most conversation.

Time/effort coming from experts who want to make things accessible is a gift! The paper is linked elsewhere in the thread if you want no-frills.

by 14 hours ago|

parent|

prev|

[-]

deleted

by 15 hours ago|

prev|

[-]

deleted

by 12 hours ago|

prev|

[-]

deleted

by vessenes11 hours ago|

prev|

[-]

So, this is either the paper of the year, or ... definitely not the paper of the year.

https://arxiv.org/pdf/2605.01172 is the current version. The money graphs are page 8 and on where they show (some weirdly thick) line charts with loss results reached in roughly 1/5 the number of steps that Adam takes, just what the blog post mentions.

They also claim holding back test data is not needed, also with more graphs.

I'm not an ML scientist, and I did not attempt to seriously parse the math. It reads to me as something precisely in that liminal space some math papers do where there's enough new terminology that actually parsing through it all is going to take real, concerted effort, possibly with mild brain damage as a risk.

Their 3d graphs of "kernel eigenstructure" also do double duty for me as totally impenetrable and possibly part of an April fool's ML paper that's hilarious to insiders. Or maybe they show something really amazing; they definitely seem to converge into a shape...What does that shape mean??? Why??? What is an eigenstructure? Is it just 3D eigenvectors of some matrices? Is it natural to have a 3D shape representing these large matrices? If not, how and why were these projected down? And why are they different colors in the paper?? You get the feel for my level of understanding.

I think it would frankly just be easier to validate this claim than parse the whole paper. If only I could understand

  > Each one-step kernel increment ηKMtSS integrates into WMS , so a sequence of one-step rate-maximizers is the greedy policy whose integral is the signal-channel content of the trajectory through G, exactly as plain SGD is the greedy step whose integral is empirical-risk descent through D. The diagonal cutoff µ2 k >σ2 k/(b−1) is the optimal first-order preconditioner for population risk on any diagonal base, and a streaming variance EMAˆst of squared gradient deviations realizes it as a one-line change to AdamW: one extra parameter-sized state vector and a per parameter gate that multiplies the standard moment update

Well enough to implement the one line update to Adam in python. I have not asked codex or claude to assist yet.

Also of note to me, they talk about grokking which I found SUUUPER fascinating when it was first reported, and have never heard about since. So I was really glad to read about it and read that there has been a little academic work on the phenomenon.

Finally, of the three models they repot results on, two are extremely tiny, the last is a DPO round on Qwen 0.5B -- if the code for that is published, I imagine it would be easy to adapt and evaluate in other regimes.

by yorwba3 hours ago|

parent|

[-]

You don't need to understand that part of the derivation to implement it. You just need Algorithm 1 on page 33 of the paper. Or look at the author's implementation: https://github.com/elonlit/PopRiskMinimization/blob/main/pop...

by vessenes3 hours ago|

parent|

[-]

Thanks for the link - I did not see a GitHub.

So, your thoughts on the paper?

by yorwba2 hours ago|

parent|

[-]

I think it's a solid theoretical contribution, but it might nonetheless fail to have practical relevance if some of their assumptions and approximations turn out to be too unrealistic. One way this could happen, for example, would be if typical training batches get gradients with a high-enough signal-to-noise ratio that their optimizer tweak ends up not tweaking much. Their somewhat unusual selection of experiments makes me suspect that this might be the case.

I read the paper earlier when it showed up on https://news.ycombinator.com/from?site=arxiv.org and the writing style of the blog post turned me off so I didn't bother to check how much it overhypes the results compared to the paper, but certainly a lot of people seem to have gotten the idea that this must be big if true, whereas I think it's better classified as neat, but not revolutionary.

by hiroakiaizawa9 hours ago|

prev|

[-]

[flagged]

by codemog8 hours ago|

prev|

[-]

Where’s the theory of how the human brain does what it does? Maybe these high dimensional structures don’t have a nice compact “theory”. Trying to fit these systems into a nice compact theory is a very human thing, but not everything works like that.