The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.
That’s a lot of words to say that, if you encode a class of things as numbers, there’s a formula somewhere that can approximate an instance of that class. It works for linear regression and works as well for neural network. The key thing here is approximation.
I can construct a Gaussian process model (essentially fancy linear regression) that will fit _all_ of my medical image data _exactly_, but it will perform like absolute rubbish for determining tumor presence compared to if I trained a convolutional neural network on the same data and problem _and_ perfectly fit the data.
I could even train a fully connected network on the same data and problem, get any degree of fit you like, and it would still be rubbish.
Data labeling is pretty big industry in some countries and I guess dropping 200 kilodollars on labeling is beyond the reach of most academics, even if they would not care about ethics of that.
Thats been my understanding of the crux of mystery.
Would love to be corrected by someone more knowledgable though