undefined

points

[-]

It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works.

by anvuong9 hours ago|

parent|

[-]

A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood.

by hodgehog116 hours ago|

parent|

[-]

You are essentially correct, which is why stochastic gradient optimizers induce a low-sharpness bias. However, there is an awful lot more that complicates things. There are plenty of wide minima that it can get stuck in far away from where people typically initialise, so the initialisation scheme proves extremely important (but is mostly done for you).

Perhaps more important, just because it is easy to escape any local minimum does not mean that there is necessarily a trend towards a really good optimum, as it can just bounce between a bunch of really bad ones for a long time. This actually happens almost all the time if you try to design your entire architecture from scratch, e.g. highly connected networks. People who are new to the field sometimes don't seem to understand why SGD doesn't just always fix everything; this is why. You need very strong inductive biases in your architecture design to ensure that the loss (which is data-dependent so you cannot ascertain this property a priori) exhibits a global bowl-like shape (we often call this a 'funnel') to provide a general trajectory for the optimizer toward good solutions. Sometimes this only works for some optimizers and not others.

This is why architecture design is something of an art form, and explaining "why neural networks work so well" is a complex question involving a ton of parts, all of which contribute in meaningful ways. There are often plenty of counterexamples to any simpler explanation.

by leoc6 hours ago|

parent|

prev|

[-]

(‘Minimum’ is the singular of ‘minima’.)

by charcircuit7 hours ago|

parent|

prev|

[-]

>you'd need every single one of them, millions up millions of them, to be all zero

If they were all correlated with each other that does not seem far fetched.

by imtringued4 hours ago|

parent|

[-]

Ok but it's already known that you shouldn't initialize your network parameters to a single constant and instead initialize the parameters with random numbers.

by charcircuit3 hours ago|

parent|

[-]

The model can converge towards such a state even if randomly initialized.

by hodgehog112 hours ago|

parent|

[-]

Both you and the comment above are correct; initializing with iid elements ensures that correlations are not disastrous for training, but strong correlations are baked into the weights during training, so pretty much anything could potentially happen.

by appplication9 hours ago|

parent|

prev|

[-]

Not a mathematician so I’m immediately out of my depth here (and butchering terminology), but it seems, intuitively, like the presence of a massive amount of local minima wouldn’t really be relevant for gradient descent. A given local minimum would need to have a “well” at least be as large as your step size to reasonably capture your descent.

E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.

by fc417fc8028 hours ago|

parent|

[-]

I believe what was meant was that assuming local minima of a sufficient size to capture your probe, given a sufficiently high density of those, you become extremely likely to get stuck. A counterpoint regarding dimensionality is made by the comment adjacent to yours.