upvote
I'm constantly surprised how many people are critical of research to understand neural nets, immediately telling me they are black boxes and hopeless to understand. I believe it's a consequence of being portrayed as the opposite of (classically interpretable) linear regression.

Many people additionally have little patience for research when the engineering is moving so quickly. Even many interpretability researchers give up far too soon if research doesn't yield immediately gratifying results.

reply
We’re in a strange era where the Information-Theoretic foundations of deep learning are solidifying. The 'Why' is largely solved: it’s the efficient minimization of irreversible information loss relative to the noise floor. There is so much waste scaling models bigger and bigger when the math points to how to do it much more efficiently. One can take a great 70B model and have it run in only ~16GB with no loss in capability and the ability to keep training, but the last few years funding only went for "bigger".

As you noted, the industry has moved the goalposts to Agency and Long-horizon Persistence. The transition from building 'calculators that predict' to 'systems that endure' is a non-equilibrium thermodynamics problem. There is math/formulas and basic laws at play here that apply to AI just as much as it applies to other systems. Ironically it is the same math. The same thing that results in a signal persisting in a model will result in agents persisting.

This is my specific niche. I study how things persist. It’s honestly a bit painful watching the AI field struggle to re-learn first principles that other disciplines have already learned. I have a doc I use to help teach folks how the math works and how to apply it to their domain and it is fun giving it folks who then stop guessing and know exactly how to improve the persistence of what they are working on. Like the idea of "How many hours we can have a model work" is so cute compared to the right questions.

reply
> It’s honestly a bit painful watching the AI field struggle to re-learn first principles that other disciplines have already learned.

This is my fear with software development in general. There's a hundred-year old point of view right next door that'll solve problems and I'm too incurious to see it.

I have a relative with a focus in math education that I've been stealing ideas from, and I think we'd both appreciate a look at your doc if you don't mind.

reply
Can you share that document?
reply
"why do neural networks work better than other models?" That sounds really interesting - any references (for a non specialist)?
reply
https://en.wikipedia.org/wiki/Universal_approximation_theore...

the better question is why does gradient descent work for them

reply
The properties that the uniform approximation theorem proves are not unique to neural networks.

Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).

So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.

reply
Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.
reply
@hodgehog11 The grokking phenomenon (Power et al. 2022) is a puzzle for the compression view: models trained on algorithmic tasks like modular arithmetic memorize training data first (near-zero training loss, near-random test accuracy) and then, after many more gradient steps, suddenly generalize. The transition happens long after any obvious compression pressure would have fired. Do you think grokking is consistent with implicit regularization as compression, or does it require a separate mechanism - something more like a phase transition in the weight norms or the Fourier frequency structure?
reply
>Do you think grokking is consistent with implicit regularization as compression

Pretty sure it's been shown that grokking requires L1 regularization which pushes model parameters towards zero. This can be viewed as compression in the sense of encoding the distribution in the fewest bits possible, which happens to correspond to better generalization.

reply
Couldn't have said it better, although this is only for grokking with the modular addition task on networks with suitable architectures. L1 regularization is absolutely a clear form of compression. The modular addition example is one of the best cases to see the phenomenon in action.
reply
Whenever people bring this up I like to remind them that linear interpolation is a universal function approximator.
reply
Can you expand on that?
reply
Universal approximation is like saying that a problem is computable

sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on

reply
> unlike f.e. which side of P/NP divide the problem is on

Actually the P/NP divide is a similar case in my opinion. In practice a quadratic algorithm is sometimes unacceptably slow and an NP problem can be virtually solved. E.g. SAT problems are routinely solved at scale.

reply
An NP problem can contain subproblems that are not worst case problems.

It's similar to the gap between pushdown automata and Turing machines. You can check if pushdown automata will terminate or not. You can't do it for Turing machines, but this doesn't stop you from running a pushdown automata algorithm on the turning machine with decidable termination.

reply
I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.
reply
It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works.
reply
A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood.
reply
You are essentially correct, which is why stochastic gradient optimizers induce a low-sharpness bias. However, there is an awful lot more that complicates things. There are plenty of wide minima that it can get stuck in far away from where people typically initialise, so the initialisation scheme proves extremely important (but is mostly done for you).

Perhaps more important, just because it is easy to escape any local minimum does not mean that there is necessarily a trend towards a really good optimum, as it can just bounce between a bunch of really bad ones for a long time. This actually happens almost all the time if you try to design your entire architecture from scratch, e.g. highly connected networks. People who are new to the field sometimes don't seem to understand why SGD doesn't just always fix everything; this is why. You need very strong inductive biases in your architecture design to ensure that the loss (which is data-dependent so you cannot ascertain this property a priori) exhibits a global bowl-like shape (we often call this a 'funnel') to provide a general trajectory for the optimizer toward good solutions. Sometimes this only works for some optimizers and not others.

This is why architecture design is something of an art form, and explaining "why neural networks work so well" is a complex question involving a ton of parts, all of which contribute in meaningful ways. There are often plenty of counterexamples to any simpler explanation.

reply
(‘Minimum’ is the singular of ‘minima’.)
reply
>you'd need every single one of them, millions up millions of them, to be all zero

If they were all correlated with each other that does not seem far fetched.

reply
Ok but it's already known that you shouldn't initialize your network parameters to a single constant and instead initialize the parameters with random numbers.
reply
The model can converge towards such a state even if randomly initialized.
reply
Both you and the comment above are correct; initializing with iid elements ensures that correlations are not disastrous for training, but strong correlations are baked into the weights during training, so pretty much anything could potentially happen.
reply
Not a mathematician so I’m immediately out of my depth here (and butchering terminology), but it seems, intuitively, like the presence of a massive amount of local minima wouldn’t really be relevant for gradient descent. A given local minimum would need to have a “well” at least be as large as your step size to reasonably capture your descent.

E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.

reply
I believe what was meant was that assuming local minima of a sufficient size to capture your probe, given a sufficiently high density of those, you become extremely likely to get stuck. A counterpoint regarding dimensionality is made by the comment adjacent to yours.
reply
Do neural networks work better than other models? They can definitely model a wider class of problems than traditional ML models (images being the canonical example). However, I thought where a like for like comparison was possible they tend to worse than gradient boosting.
reply
Gradient boosting handles tabular data better than neural networks, often because the structure is simpler, and it becomes more of an issue to deal with the noise. You can do like-to-like comparisons between them for unstructured data like images, audio, video, text, and a well-designed NN will mop the floor with gradient boosting. This is because to handle that sort of data, you need to encode some form of bias around expected convolutional patterns in the data, or you won't get anywhere. Both CNNs and transformers do this.
reply
Would you agree/disagree with the following:

- It's not gradient boosting per se that's good on tabular data, it's trees. Other fitting methods with trees as the model are also usually superior to NNs on tabular data.

- Trees are better on tabular data because they encode a useful inductive bias that NNs currently do not. Just like CNNs or ViTs are better on images because they encode spatial locality as an inductive bias.

reply
Absolutely agree on both counts. Gradient boosting is the most commonly known and most successful variant, but it's the decision tree structure that is the underlying architecture there. Decision trees don't have the same "implicit training bias" phenomenon that neural networks have though, so all of this is just model bias in the classical statistical sense.
reply
Can NNs be made to be better than trees on tabular data with some further constraints, or something?
reply
[dead]
reply
In my opinion current research should focus on revisiting older concepts to figure out if they can be applied to transformers.

Transformers are superior "database" encodings as the hype about LLMs points out, but there have been promising ML models that were focusing on memory parts for their niche use cases, which could be promising concepts if we could make them work with attention matrixes and/or use the frequency projection idea on their neuron weights.

The way RNNs evolved to LSTMs, GRUs, and eventually DNCs was pretty interesting to me. In my own implementations and use cases I wasn't able to reproduce Deepmind's claims in the DNC memory related parts. Back at the time the "seeking heads" idea of attention matrixes wasn't there yet, maybe there's a way to build better read/write/access/etc gates now.

[1] a fairly good implementation I found: https://github.com/joergfranke/ADNC

reply
> why do neural networks work better than other models

The only people for whom this is an open question are the academics - everyone else understands it's entirely because of the bagillions of parameters.

reply
No it isn't, and it's frustrating when the "common wisdom" tries to boil it down to this. If this was true, then the models with "infinitely many" parameters would be amazing. What about just training a gigantic two-layer network? There is a huge amount of work trying to engineer training procedures that work well.

The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.

reply
> The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.

That’s a lot of words to say that, if you encode a class of things as numbers, there’s a formula somewhere that can approximate an instance of that class. It works for linear regression and works as well for neural network. The key thing here is approximation.

reply
No, it is relatively few words to quickly touch on several different concepts that go well beyond basic approximation theory.

I can construct a Gaussian process model (essentially fancy linear regression) that will fit _all_ of my medical image data _exactly_, but it will perform like absolute rubbish for determining tumor presence compared to if I trained a convolutional neural network on the same data and problem _and_ perfectly fit the data.

I could even train a fully connected network on the same data and problem, get any degree of fit you like, and it would still be rubbish.

reply
That isn't what they are saying at all, lol.
reply
Also massive human work done on them, that wasn't done before.

Data labeling is pretty big industry in some countries and I guess dropping 200 kilodollars on labeling is beyond the reach of most academics, even if they would not care about ethics of that.

reply
normally more parameters leads to overfitting (like fitting a polynomial to points), but neural nets are for some reason not as susceptible to that and can scale well with more parameters.

Thats been my understanding of the crux of mystery.

Would love to be corrected by someone more knowledgable though

reply
This absolutely was the crux of the (first) mystery, and I would argue that "deep learning theory" really only took off once it recognized this. There are other mysteries too, like the feasibility of transfer learning, neural scaling laws, and now more recently, in-context learning.
reply