Reminds when i did noogler training back in the day and one of the talks described a cascading failure at a datacenter, starting with a cat which was too curious near a power conditioner, and briefly conducted
Its cold up here in the winter, sadly, the residual heat from even totally passive components like switch gear is enough to warm things up enough to attract them. .001% of 1MW of power is still quite warm. (I have no idea how much switchgear leaks but i know they are warm even in winter outdoors).
And, yeah, the rest of the writeup is also an amalgamation of some panic-inducing experiences in my life.
But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.
Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".
This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.
spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.
A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.
Some fail below 100% too.
But this is the physical world, shit happens.
The algorithm didn't know that fuse was lose and fine at 50% duty cycle but was high resistance and going to blow at 100%.