In the past with smaller services those services did break all the time, but the outage was limited to a much smaller area. Also systems were typically less integrated with each other so one service being down rarely took out everything.
What leads to more failure is when you don't engineer those consolidated entities to be reliable. Tech companies have none of the legal requirements or incentives to be reliable, the way physical infrastructure companies do. I agree that the tighter integration is an issue, but the root cause is tech companies have no incentive other than profits. If they're making profits, everything's fine.
Many years ago, I started working for chip companies. It was like a breath of fresh air. Successful chip companies know the costs (both direct money and opportuity) of a failed tapeout, so the metaphorical equivalent of this cord was there.
Find a bug the morning of tapeout? It will be carefully considered and triaged, and maybe delay tapeout. And, as you point out, the cultural aspect is incredibly important, which means that the messenger won't be shot.