I think as time went on, and hardware got better, it seemed more reasonable to actually think about a viable implementation of what I think was a widespread intuition anyone in ML had that everything's context is everything.
It just seemed like a theoretical thing until hardware caught up. Maybe. Perhaps I'm applying a retrospective excuse to why it took so long.
I don't think it was intuitive to anyone back then, the vanishing gradient problem was a big deal since the dawn of NNs. I'm not sure what you mean by sheer computation, residuals allow you to have deep networks instead of shallow and wide ones. You can have equivalent parameter count.