upvote
This is not a bad question!

If you flip it and instead ask "how do I write something that can't fail?" you might find some interesting ground.

The best things I know about are static type-checking, pure functions and totality. Different languages provide more or less help with these things. It's perfectly fine to do 'two things which don't fail or cause other things to fail'.

Forgive the digression, but there is an 'infectious' aspect to the above 3 things (see the function-colouring problem), e.g. you can't build pure functions which call non-pure functions. The Dependency Inversion Principle (of SOLID) gives some help in how to tackle this.

Also, the above things only work within one node (of a distributed system).

For multiple nodes, I use something like Kafka, where you write down one event, and have two systems subscribe to it, each doing one thing. Yes, there's still the obvious issue of them failing independently, but when that happens, you have an authoritative source of truth (in the form of Kafka events). This beats the craps out of developer logs.

You skip the laborious questions of "what happened in the system?" and "what should the correct state be?" Because the events are already the answer - just eyeball them.

Events also machine-readable, so if you diagnose a problem and a fix it in one case, there's a good chance you can build a detector for other cases. You don't have to wait for a support ticket to get escalated to the dev team.

You also divide the debugging space dramatically. If the Kafka log says one thing {Bob bought Minecraft for $10}, then the Ownership service is just wrong if it says Bob doesn't own Minecraft, and the Finance service is just wrong if it doesn't report the $10. Fix each independently. At no point do you need to look at Ownership and Finance together to see which one failed halfway through talking to the other, because they don't talk to each other.

Lastly, events are verifiable; they are their own audit trail. If your boss asks how much money is in the system, would you feel more confident reporting whatever the current balance is set to (i.e. the outcome of whatever code executed the last "UPDATE Balance ..." statement, or would you like to be able to sum over every transaction that you ever recorded?

reply
A programmer had a problem, so they decided to use threads.

Now they have at least two problems.

reply