upvote
This is a great writeup! thank you!!

Reminds when i did noogler training back in the day and one of the talks described a cascading failure at a datacenter, starting with a cat which was too curious near a power conditioner, and briefly conducted

reply
The cat incident at a facility I worked at.

Its cold up here in the winter, sadly, the residual heat from even totally passive components like switch gear is enough to warm things up enough to attract them. .001% of 1MW of power is still quite warm. (I have no idea how much switchgear leaks but i know they are warm even in winter outdoors).

And, yeah, the rest of the writeup is also an amalgamation of some panic-inducing experiences in my life.

reply
I could totally get into “Ops Thriller” genre of novels like this.
reply
I'd expect someone like AWS to just throttle machines before overloading their cooling. Because they probably can do that, while e.g. a data center that just rents the space can't really throttle their customers nicely.
reply
Reducing clock speeds, even if they could do that -- and I'm not sure they can, given how Nitro is designed -- would be problematic since a lot of customer workloads assume homogeneous nodes.

But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.

reply
> But they did load-shed

Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".

This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.

spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.

reply
Its harder and harder to throttle machines with hardware segmentation capabilities effectively passing through hardware components "intact"

A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.

reply
The cooling units dont fail just because they get to 100% duty cycle. That's pretty much "normal operation", you just get... higher efficiency coz the cooling side is warmer
reply
Of course not. They fail above 100%.

Some fail below 100% too.

reply
You can't have a duty cycle above 100%. It's impossible.
reply
Not according to POTUS math. You can have 200%, 500%, 600%, 1200%. You just have to say it enough and people will question if they really might not understand percentages enough, and just go with it.
reply
ok but cooling systems don't run on POTUS math though
reply
Nor does the rest of the world
reply
This is written beautifully. It's like a much more inconsequential variant of Chernobyl.
reply
I would have thought with all the data centers being built the parts for cooling systems would be standardized with replacements available from Grainger immediately.
reply
Shouldn't there be a feedback system here preventing the scheduling of loads when cooling is degraded?
reply
With hyperscalers for sure.

But this is the physical world, shit happens.

The algorithm didn't know that fuse was lose and fine at 50% duty cycle but was high resistance and going to blow at 100%.

reply