undefined

points

[-]

This is a great writeup! thank you!!

Reminds when i did noogler training back in the day and one of the talks described a cascading failure at a datacenter, starting with a cat which was too curious near a power conditioner, and briefly conducted

by AdamJacobMuller11 hours ago|

parent|

[-]

The cat incident at a facility I worked at.

Its cold up here in the winter, sadly, the residual heat from even totally passive components like switch gear is enough to warm things up enough to attract them. .001% of 1MW of power is still quite warm. (I have no idea how much switchgear leaks but i know they are warm even in winter outdoors).

And, yeah, the rest of the writeup is also an amalgamation of some panic-inducing experiences in my life.

by bombcar7 hours ago|

parent|

[-]

https://x.com/visakanv/status/1678745111411212290

by dtjohnnymonkey9 hours ago|

prev|

[-]

I could totally get into “Ops Thriller” genre of novels like this.

by fabian2k14 hours ago|

prev|

[-]

I'd expect someone like AWS to just throttle machines before overloading their cooling. Because they probably can do that, while e.g. a data center that just rents the space can't really throttle their customers nicely.

by cperciva13 hours ago|

parent|

[-]

Reducing clock speeds, even if they could do that -- and I'm not sure they can, given how Nitro is designed -- would be problematic since a lot of customer workloads assume homogeneous nodes.

But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.

by AdamJacobMuller11 hours ago|

parent|

[-]

> But they did load-shed

Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".

This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.

spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.

by AdamJacobMuller13 hours ago|

parent|

prev|

[-]

Its harder and harder to throttle machines with hardware segmentation capabilities effectively passing through hardware components "intact"

A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.

by PunchyHamster13 hours ago|

prev|

[-]

The cooling units dont fail just because they get to 100% duty cycle. That's pretty much "normal operation", you just get... higher efficiency coz the cooling side is warmer

by AdamJacobMuller13 hours ago|

parent|

[-]

Of course not. They fail above 100%.

Some fail below 100% too.

by tardedmeme12 hours ago|

parent|

[-]

You can't have a duty cycle above 100%. It's impossible.

by dylan60410 hours ago|

parent|

[-]

Not according to POTUS math. You can have 200%, 500%, 600%, 1200%. You just have to say it enough and people will question if they really might not understand percentages enough, and just go with it.

by tardedmeme9 hours ago|

parent|

[-]

ok but cooling systems don't run on POTUS math though

by dylan6049 hours ago|

parent|

[-]

Nor does the rest of the world

by lukeify8 hours ago|

prev|

[-]

This is written beautifully. It's like a much more inconsequential variant of Chernobyl.

by wombatpm12 hours ago|

prev|

[-]

I would have thought with all the data centers being built the parts for cooling systems would be standardized with replacements available from Grainger immediately.

by foota13 hours ago|

prev|

[-]

Shouldn't there be a feedback system here preventing the scheduling of loads when cooling is degraded?

by AdamJacobMuller13 hours ago|

parent|

[-]

With hyperscalers for sure.

But this is the physical world, shit happens.

The algorithm didn't know that fuse was lose and fine at 50% duty cycle but was high resistance and going to blow at 100%.