Think of it as a best fit curve and exceptions to that curve. The noise is essentially this set of exceptions that move points away from where they would otherwise fall on the curve.
Gradient descent wants to be able to make the smallest change that moves the most data points towards the curve. To do this it learns an arrangement where it can change, say, one parameter and have a bunch of points move at once. What does this correspond to? The big common patterns shared by many data points.
Most of the capacity gets soaked up modelling these sorts of common patterns, and after they have been learned the model starts adding exceptions that allow individual points to deviate from the curve.
Because they’re exceptions, they must not impact neighbouring points, or at least only ones within a very short distance from them. Otherwise they’re now driving the error higher by impacting more points than they should. So you end up with very narrow ranges of features that are able to trigger different sorts of noise.
How narrow they are is shaped by the training data, they’re exactly as narrow as needed not to raise the error, so assuming the total population has the same distribution, they don’t get hit. Much.
At least, that’s what I take away from it.
I suspect there is going to be a lot of handwaving to actually go from eNTK to that new update rule.
I also doubt it helps in the non-grokking regime, given the focus of the theory, which is where all the practical applications I have ever heard from live.
Don't get me wrong, I did enjoy reading this essay. It's well written and reasonably argumented without going into details.
Nah, the softer stuff seems like valuable outreach / good science communication for people that aren't up for the math. Including probably lots of software engineers who are sick of dumb debates in forums, and starting to dip into the real literature and listen to better authorities. More people should do this really, since it's the only way to see past the marketing and hype from fully entrenched AI boosters or detractors. Neither of those groups is big on critical thinking, and they dominate most conversation.
Time/effort coming from experts who want to make things accessible is a gift! The paper is linked elsewhere in the thread if you want no-frills.