undefined

points

[-]

Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.

by jimmypk5 hours ago|

parent|

[-]

@hodgehog11 The grokking phenomenon (Power et al. 2022) is a puzzle for the compression view: models trained on algorithmic tasks like modular arithmetic memorize training data first (near-zero training loss, near-random test accuracy) and then, after many more gradient steps, suddenly generalize. The transition happens long after any obvious compression pressure would have fired. Do you think grokking is consistent with implicit regularization as compression, or does it require a separate mechanism - something more like a phase transition in the weight norms or the Fourier frequency structure?

by hackinthebochs4 hours ago|

parent|

[-]

>Do you think grokking is consistent with implicit regularization as compression

Pretty sure it's been shown that grokking requires L1 regularization which pushes model parameters towards zero. This can be viewed as compression in the sense of encoding the distribution in the fewest bits possible, which happens to correspond to better generalization.

by hodgehog112 hours ago|

parent|

[-]

Couldn't have said it better, although this is only for grokking with the modular addition task on networks with suitable architectures. L1 regularization is absolutely a clear form of compression. The modular addition example is one of the best cases to see the phenomenon in action.

by soVeryTired3 hours ago|

prev|

[-]

Whenever people bring this up I like to remind them that linear interpolation is a universal function approximator.

by bilsbie1 hours ago|

parent|

[-]

Can you expand on that?

by NooneAtAll37 hours ago|

prev|

[-]

Universal approximation is like saying that a problem is computable

sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on

by ngruhn6 hours ago|

parent|

[-]

> unlike f.e. which side of P/NP divide the problem is on

Actually the P/NP divide is a similar case in my opinion. In practice a quadratic algorithm is sometimes unacceptably slow and an NP problem can be virtually solved. E.g. SAT problems are routinely solved at scale.

by imtringued5 hours ago|

parent|

[-]

An NP problem can contain subproblems that are not worst case problems.

It's similar to the gap between pushdown automata and Turing machines. You can check if pushdown automata will terminate or not. You can't do it for Turing machines, but this doesn't stop you from running a pushdown automata algorithm on the turning machine with decidable termination.