upvote
People definitely wanted to train deep networks before, but didn't know how. They evdn tried things like training layers independently.

I don't think it was intuitive to anyone back then, the vanishing gradient problem was a big deal since the dawn of NNs. I'm not sure what you mean by sheer computation, residuals allow you to have deep networks instead of shallow and wide ones. You can have equivalent parameter count.

reply