undefined

points

[-]

> They found previously unknown race conditions. Not in obscure corners of the codebase. Not in exotic edge cases. In the kind of code that every Erlang application depends on, code that had been running in production for years.

If these race conditions are in code that has been in production for years and yet the race conditions are "previously unknown", that does suggest to me that it is in practice quite hard to trigger these race conditions. Bugs that happen regularly in prod (and maybe I'm biased, but especially bugs that happen to erlang systems in prod) tend to get fixed.

by aeonfox9 hours ago|

parent|

[-]

True. And that the subtle bugs were then picked up by static analysis makes the safety proposition of Erlang even better.

> Bugs that happen regularly in prod

It depends on how regular and reproducible they are. Timing bugs are notoriously difficult to pin down. Pair that with let-it-crash philosophy, and it's maybe not worth tracking down. OTOH, Erlang has been used for critical systems for a very long time – plenty long enough for such bugs to be tracked down if they posed real problems in practice.

by thesz8 hours ago|

parent|

prev|

[-]

Erlang has "die and be restarted" philosophy towards process failures, so these "bugs that happen to erlang systems in prod" may not be fixed at all, if they are rare enough.

by toast06 hours ago|

parent|

[-]

As of now, the post you're replying to says "bugs that regularly happen ... in prod"

Now, if it crashes every 10 years, that is regular, but I think the meaning is that it happens often. Back when I operated a large dist cluster, yes, some rare crashes happened that never got noticed or the triage was 'wait and see if it happens again' and it didn't happen. But let it crash and restart from a known good state is a philosophy about structuring error checking more than an operational philosophy: always check for success and if you don't know how to handle an error fail loudly and return to a good state to continue.

Operationally, you are expected to monitor for crashes and figure out how to prevent them in the future. And, IMHO, be prepared to hot load fixes in response... although a lot of organizations don't hot load.

by dnautics5 hours ago|

prev|

[-]

not all races are bugs. here's an example that probably happens in many systems that people just don't notice: sometimes you don't care and, say, having database setup race against setup of another service that needs the database means that in 99% of cases you get a faster bootup and in 1% of cases the database setup is slow and the dependent server gets restarted by your application supervisor and connects on the second try.

by kamma44349 hours ago|

prev|

[-]

The 4th issue is a feature- it’s what allows zero downtime hot updates.