upvote
It's a systems engineering job. You need to provide context, acceptable failure modes, and test at each level for validation. Identify false coupling, poor interfaces, things that don't match business context during agent planning phase. Then communicate / translate to others so their decisions improve instead of destroying the system by optimizing only for their local situation.
reply
It also seems like massive consolidation has caused issues too. Everyone is on Github. Everyone is on AWS. Everyone is behind cloudflare. Whenever an issue happens here it effects everyone and everyone sees it.

In the past with smaller services those services did break all the time, but the outage was limited to a much smaller area. Also systems were typically less integrated with each other so one service being down rarely took out everything.

reply
The power company is massively consolidated, as is the water supply, telephone service. These are monolithic, monopolistic entities. But they are also very reliable (failures are usually isolated by region, or a result of natural disaster).

What leads to more failure is when you don't engineer those consolidated entities to be reliable. Tech companies have none of the legal requirements or incentives to be reliable, the way physical infrastructure companies do. I agree that the tighter integration is an issue, but the root cause is tech companies have no incentive other than profits. If they're making profits, everything's fine.

reply
I mean recommend professional software engineering licenses here on HN and it goes over like a turd in a punch bowl. Everyone knows where the search for more profit was going, no one wanted to get off the ride though.
reply
Super good take - the Andon cord is needed everywhere.
reply
> One way to deal with this in DevOps/Lean/TPS is the Andon cord.

Many years ago, I started working for chip companies. It was like a breath of fresh air. Successful chip companies know the costs (both direct money and opportuity) of a failed tapeout, so the metaphorical equivalent of this cord was there.

Find a bug the morning of tapeout? It will be carefully considered and triaged, and maybe delay tapeout. And, as you point out, the cultural aspect is incredibly important, which means that the messenger won't be shot.

reply