Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).
So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.
Pretty sure it's been shown that grokking requires L1 regularization which pushes model parameters towards zero. This can be viewed as compression in the sense of encoding the distribution in the fewest bits possible, which happens to correspond to better generalization.
sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on
Actually the P/NP divide is a similar case in my opinion. In practice a quadratic algorithm is sometimes unacceptably slow and an NP problem can be virtually solved. E.g. SAT problems are routinely solved at scale.
It's similar to the gap between pushdown automata and Turing machines. You can check if pushdown automata will terminate or not. You can't do it for Turing machines, but this doesn't stop you from running a pushdown automata algorithm on the turning machine with decidable termination.
Perhaps more important, just because it is easy to escape any local minimum does not mean that there is necessarily a trend towards a really good optimum, as it can just bounce between a bunch of really bad ones for a long time. This actually happens almost all the time if you try to design your entire architecture from scratch, e.g. highly connected networks. People who are new to the field sometimes don't seem to understand why SGD doesn't just always fix everything; this is why. You need very strong inductive biases in your architecture design to ensure that the loss (which is data-dependent so you cannot ascertain this property a priori) exhibits a global bowl-like shape (we often call this a 'funnel') to provide a general trajectory for the optimizer toward good solutions. Sometimes this only works for some optimizers and not others.
This is why architecture design is something of an art form, and explaining "why neural networks work so well" is a complex question involving a ton of parts, all of which contribute in meaningful ways. There are often plenty of counterexamples to any simpler explanation.
If they were all correlated with each other that does not seem far fetched.
E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.