upvote
No it isn't, and it's frustrating when the "common wisdom" tries to boil it down to this. If this was true, then the models with "infinitely many" parameters would be amazing. What about just training a gigantic two-layer network? There is a huge amount of work trying to engineer training procedures that work well.

The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.

reply
> The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.

That’s a lot of words to say that, if you encode a class of things as numbers, there’s a formula somewhere that can approximate an instance of that class. It works for linear regression and works as well for neural network. The key thing here is approximation.

reply
No, it is relatively few words to quickly touch on several different concepts that go well beyond basic approximation theory.

I can construct a Gaussian process model (essentially fancy linear regression) that will fit _all_ of my medical image data _exactly_, but it will perform like absolute rubbish for determining tumor presence compared to if I trained a convolutional neural network on the same data and problem _and_ perfectly fit the data.

I could even train a fully connected network on the same data and problem, get any degree of fit you like, and it would still be rubbish.

reply
That isn't what they are saying at all, lol.
reply
Also massive human work done on them, that wasn't done before.

Data labeling is pretty big industry in some countries and I guess dropping 200 kilodollars on labeling is beyond the reach of most academics, even if they would not care about ethics of that.

reply
normally more parameters leads to overfitting (like fitting a polynomial to points), but neural nets are for some reason not as susceptible to that and can scale well with more parameters.

Thats been my understanding of the crux of mystery.

Would love to be corrected by someone more knowledgable though

reply
This absolutely was the crux of the (first) mystery, and I would argue that "deep learning theory" really only took off once it recognized this. There are other mysteries too, like the feasibility of transfer learning, neural scaling laws, and now more recently, in-context learning.
reply