upvote
I call it the "doing two things" problem.

Your write imperative code, which issues two commands, both of which can fail independently.

There are plenty of ways to pretend to 'deal with it'.

Firstly it will just pass all tests, so most devs can stop thinking about it right away.

A dev might think you can just catch and log the exception. Doesn't fix it.

You could run the code in prod for a while, see if it goes wrong. It will, at which point the dev will try it again, and it will probably work the second time, so they can stop thinking about it.

There was a big outbox pattern discussion a couple of days ago (split thing 1 into two halves, and do them atomically, leave thing 2 as an exercise for the reader.)

I think the reason you encounter this problem in the real world is that devs just exist in some quantum superposition of "it won't happen" and "I fixed it" and "it can't be fixed".

reply
> A dev might think you can just catch and log the exception. Doesn't fix it.

You've just succinctly made the argument against checked exceptions FWIW (which I agree with you on). Anyone who has used Java in anger (is there any other way?) will be familiar with:

    try {
      doSoemthing();
    } catch (CheckedException e) {
      logger.error("Didn't work", e);
    }
Fault tolerance is general is terrible in most software. One of my biggest bugbears is network latency and transient failures in network requests that would be solved with a simple retry. But no, there's an incredibly lazy "Request failed" dialog to the user. That's the equivalent of the "log and silently swallow" pattern above. It can get a lot worse than that too. I have an app on my phone that will log me out and force me into a 2FA cycle if it hits a network timeout. Like.... WHYW?!?!?! Anyway, I digress...

This is largely a sotware issue. Control systems are built to handle these kinds of things. A traffic light can't accidentally show green in two directions. It's literally wired for that to be impossible because it's simply too important for it to not be possible. You constantly have to deal with faulty sensors so you have systems that will seek a consensus from 3+ sensors and, if that fails, it'll fail until you fix it.

But in software the standards just seem to be much lower even though it can be critical, even lethal eg [1]. Network interfaces should be fuzzed. Every IO operation should assume it can fail and be tested for when it does. Every IO operation should produce unexpected output. And it's simply cost-cutting and a lack of regulation that allows this sloppiness to persist. There should certainly be strict liability for any companies that allow this to happen.

[1]: https://ethicsunwrapped.utexas.edu/case-study/therac-25

reply
As much as I agree with the spirit of your post, standards did in fact change after the Therac-25 incident. That was nearly 50 years ago, after all! There are very high quality bars for medical equipment.
reply
> It can get a lot worse than that too. I have an app on my phone that will log me out and force me into a 2FA cycle if it hits a network timeout.

I use some fairly popular (in the MSP space) backup software that thinks the network is infallible. The worst case I’ve seen is when it fails on a network request, doesn’t retry adequately, and incorrectly logs the error as data corruption.

reply
IMO a lot of these problems come down to the same root cause: we are not fully enumerating and reasoning about failure cases.

Let's say you want to retry a network request. It's... A bit more complex than it seems, right?

Firstly, you need to know exactly what type of error you ran into. Some errors aren't really recoverable. Maybe a programming issue occurred and you are constructing an invalid URL and the HTTP client is yelling at you. No sense in retrying that 20 times. Maybe it's a network error, that seems like a good candidate to retry. Maybe, the request succeeded and we have a response, but it is a 500 error, again, seems like a good candidate.

Secondly, you need to know if it is safe to retry. If the request is essentially idempotent, like a read-only GET request, then surely it is safe, right? But, what if it isn't safe? Forget about solutions like idempotency tokens; let's assume you don't control that. Now you need to figure out how you can know if the request had side effects. If a well-known 4xx error is returned you might know, but if you get a network error or a 5xx error it's much harder. Did the request fail during a buffered response after the side effects were already applied? Maybe you can check to see if the request applied with another request. Now you have two network requests, and both need error handling.

Finally, and probably most obviously, you have to make sure you don't hammer the server when it is under load. To avoid the thundering herd problem, you'll probably want to use an exponential backoff with some jitter.

What sucks about all of this is that while there are reusable components here, the concerns effortlessly cut through different layers, making them a pain in the ass to deal with. It isn't that it is impossible for a library to handle all of these problems (I anticipate an excited evangelist may reply explaining how their favorite library does it all in one package if this post gets enough visibility) it's just that this is hard and these problems repeat in different forms, in a way that makes it difficult to fully eliminate the repetition. And this is just the most obvious basics, whereas in reality there are almost always case-specific complexities.

You can, for example, encapsulate a reasonable exponential backoff with deadline implementation and apply that as appropriate for different things, but you can't really cheat your way out of having to think about all of these things, especially if you don't control all of the network APIs you might have to interface with.

This is one part of why I don't like try/catch exceptions. They are an appropriate mechanism to use as a failure isolation boundary due to their stack unwinding capability: it would still be bad in most cases if a logic error or upstream error not being handled properly in a single network request handler were able to crash an entire network server, so being able to blanket catch everything that bubbles up an log it is good. But then using this for normal error handling, it makes doing the wrong thing perhaps just a bit too easy. I don't think you should have to self-flaggelate in order to say "just crash if this errors", but I do think that you should have to say it. Try/catch exceptions are backwards by default, just write normal looking control flow and no errors are handled and it's hard to tell if there even are any. Checked exceptions try to fix this but somehow this feels even worse; now you have a flattened list of exceptions that may occur at various different layers of depth, in some cases the same exception can occur at different layers of depth, you may literally need to read source code and map out the call stack in your head to be sure. (Hope it doesn't change later.)

The Result or Expected type concept seems like the way to go in the frame of modern programming languages. Go's error passing also works OK though it has papercuts (that a linter can help you with, at least.) To me it makes more sense to make stack unwinding error handling a more niche feature used for isolating error domains, rather than use them for all error handling.

But even that! Even that doesn't solve the problem. You still have to sit there and think about the types of errors that can occur and their consequences. At best, explicit error handling with value types just encourages you to confront it and makes it visible, even in cases where you still say "OK, pass to caller".

reply
I don't disagree with anything in particular here, but other developers might fall into a trap with:

  we are not fully enumerating and reasoning about failure cases.
This might put (or keep) a developer in the mindset that they can code a series of imperative instructions to build their minimal viable product, and then come back and tighten things up later.

I expend all my effort in avoiding 'doing two things'. It's bloody difficult, but since I've come around to thinking that recovering from 1-of-2-things-failing is probably impossible in most situations, doing it the bloody difficult way is easier.

reply
deleted
reply
Why can the two things even fail at all?
reply
This is not a bad question!

If you flip it and instead ask "how do I write something that can't fail?" you might find some interesting ground.

The best things I know about are static type-checking, pure functions and totality. Different languages provide more or less help with these things. It's perfectly fine to do 'two things which don't fail or cause other things to fail'.

Forgive the digression, but there is an 'infectious' aspect to the above 3 things (see the function-colouring problem), e.g. you can't build pure functions which call non-pure functions. The Dependency Inversion Principle (of SOLID) gives some help in how to tackle this.

Also, the above things only work within one node (of a distributed system).

For multiple nodes, I use something like Kafka, where you write down one event, and have two systems subscribe to it, each doing one thing. Yes, there's still the obvious issue of them failing independently, but when that happens, you have an authoritative source of truth (in the form of Kafka events). This beats the craps out of developer logs.

You skip the laborious questions of "what happened in the system?" and "what should the correct state be?" Because the events are already the answer - just eyeball them.

Events also machine-readable, so if you diagnose a problem and a fix it in one case, there's a good chance you can build a detector for other cases. You don't have to wait for a support ticket to get escalated to the dev team.

You also divide the debugging space dramatically. If the Kafka log says one thing {Bob bought Minecraft for $10}, then the Ownership service is just wrong if it says Bob doesn't own Minecraft, and the Finance service is just wrong if it doesn't report the $10. Fix each independently. At no point do you need to look at Ownership and Finance together to see which one failed halfway through talking to the other, because they don't talk to each other.

Lastly, events are verifiable; they are their own audit trail. If your boss asks how much money is in the system, would you feel more confident reporting whatever the current balance is set to (i.e. the outcome of whatever code executed the last "UPDATE Balance ..." statement, or would you like to be able to sum over every transaction that you ever recorded?

reply
A programmer had a problem, so they decided to use threads.

Now they have at least two problems.

reply
> Why? How, even, have they implemented this?

This is really common because of two design features that most UI frameworks share:

- The code that changes the color of the button is an internal part of the "button" component, so that people don't have to individually implement it on every button. But this means that it's kind of disconnected from the code that actually performs the action. If the "on click" handler has some last-ditch check that aborts the action, like the "don't rotate the image if it's in the middle of the rotate animation" check from the article, often there's no way for it to tell the button to cancel the color change. (And conversely sometimes the "on click" handler can fire even if the color change animation doesn't play correctly.)

- Buttons usually change color when you press down the mouse button, but only perform the action when you release the mouse button. Sometimes this is used to intentionally give you a chance to cancel the action at the very last minute by dragging your mouse off the button while it's still held down (or, on mobile, to e.g. reinterpret your interaction as scrolling instead of clicking), other times it just creates more opportunities for something to happen that prevents the action from working after the color change has already happened.

reply
> there's no way for it to tell the button to cancel the color change

No, but what should happen in cases like that is that the on-click handler disables the button while it is unresponsive. This will communicate the fact that the button is unresponsive visually to the user and also inhibit the button-was-pressed feedback.

reply
Of course, one can fix these problems. GP was merely saying why this kind of mistake is common; it is definitely a mistake, not an inevitability.
reply
Simply disabling the button leads to people thinking something is broken, so you need to add a visual "disabled" state - which should probably be separate from the "you are currently pressing the button" state.

In most cases that is going to lead to annoying pointless flickering as most actions & animations are basically instantaneous, and with touchscreens even in the non-pointless scenarios it won't have the desired effect as the button itself will be hidden from the user by their own finger.

In principle I think you are right, but in practice buffering presses is often probably the more user-friendly option.

reply
> Simply disabling the button leads to people thinking something is broken, so you need to add a visual "disabled" state - which should probably be separate from the "you are currently pressing the button" state.

Well, yes, dropping user inputs is "being broken"

reply
> Sometimes this is used to intentionally give you a chance to cancel the action

EDIT: sometimes UI elements with mouse-held interaction allow you to use the escape key to cancel an in-progress interaction (ESC: abort, mouse-up: commit) however the reply button on this page doesn't work that way so I have to edit this message to add this. That escape-key behavior should be universal I think.

reply
I notice this pretty consistently with elevators: If you press the button for a short amount of time, it visibly lights up while pressed but doesn't actually register the button-press.
reply
I have long presumed that sort of thing to be deliberate, avoiding activation on accidental bump.
reply
Then it should light up when the request is acknowledged, and stay lit up until the elevator arrives.

But wait, there's more: when the elevator arrives until it leaves, the button should flash or change to a more prominent color. Why? Because imagine someone presses up and someone else presses down, and the elevator arrives going up. If the up button switches off at this point, now only the down button is lit which clearly signals the elevator is going down, which is wrong.

reply
Perhaps we’re not talking about the same thing. What I refer to has the button light up while you bump it, and then go dark again, whereas if you press it more deliberately, it stays lit (and takes effect). This can apply to the buttons inside or outside the lift.
reply
It does sounds somewhat reasonable on paper. But some of the crosswalk buttons in Belgium have this as well, and it's really jarring. You press the button, see the light go on and look away to look at the traffic light and wait for it to turn green. Except 20 seconds later you look back and the indicator light is off again. I feel very strongly that the indicator light should only turn on when your press has been registered.

Let's say you tell someone to do something, and they say "ok". But when you ask them later whether they did, they say "oh no, I just said ok to indicate that I heard you, not that I was going to do something about it." That doesn't make any sense. The indicator light has the same function. Going on and then off again is a violation of basic communication protocol.

reply
At intersections, I usually find myself not looking at my red “DON'T WALK” signal, but I’m watching the other traffic signals which are green, and watching for them to turn yellow then red, which means that the next step in the pattern is a green signal for me.

The more complex the intersection, the more controls I can watch, to get a feel for the rhythm, patterns, and triggers that influence the cue to activate my “WALK” signal.

I’m also watching the cars to see when the flows slow down or stop. I should perhaps pay more attention to the signal that pertains to me as a pedestrian or motorist...

reply
But then you might tap the button and think it's broken, because it does nothing. The light means "this button works", not "your desired action has registered".

I guess you might want to fade it from red to green (red being "this works" and green being "it'll do what I want"), but I don't mind the holding-down behaviour. The only problem is that you can never know how long you need to hold it down for unless you stop holding it.

reply
The color change of the button shows you succeeded in pushing it. If you don't do this instantly most people are conditioned to try again. This is especially valuable for people with reduced motor control. It is completely independent of whether that push is a useful input given the current state of the software. Obviously when well written software knows it can't accept the input it should have disabled the button, and even moderately well written software needs to provide a near instant feedback that the action is processing or has been cancelled.
reply
I suppose a lack of testing and an assumption that the action will fail so rarely that it’s not worth accounting for? But yes, such patterns make it hard to trust and efficiently use an interface.
reply
I've only worked for one company that actually did proper QA testing. It's expensive and time consuming and often the main functionality is okay, so many just skips QA altogether.

It is probably daily that I encounter products and procedures where I can see that a given scenario is kind of a an edge case, but not an unforeseeable one. Given the scale of many things, edge cases happen pretty frequently and with ever more ridged organisations, lack of customer service, human interaction and a quest for ever more cost savings, hitting an edge case can be everything from frustrating to catastrophic for a person.

Generally I think we, as in humans, need to slow down.

reply
Not handling rare cases is the root of a lot of bad software. “Oh, this bug will only affect 1% of users. Don’t bother fixing it: we have features to cram.”
reply
This is sometimes done intentionally to hide latency and make a UI feel faster - I certainly don’t like it though.
reply
Ahh, the stochastic behavior of networked devices!
reply
Another one I see is low end devices have a volume knob that instead of being a potentiometer are a rotary dial encoder so you end up usually only being able to adjust the volume as fast as it's sampled, which is slower than you want for example in traffic turning the radio down to hear stuff
reply
Bad programming. People who have experience with embedded programming knows that reading out a button usually means denouncing. At the speed a microcontroller can read out a button it will change it's state multiple times per press because of contact bounce. Meaning when a user presses a button the program sees off, on, off, off, on, on, off, on, on, on, on, on, on, etc.

Now if you just naively read out the current state of the button and do something with it elsewhere in the program looping may be off or on randomly.

It is not hard to imagine if there is some other logic (or e.g. a rate limit) on the 30 seconds and on the beep that these would see different slices in time of the button. Congrats you built a button-debounce based RNG.

Physical buttons can be surprisingly complex if you don't rely on someone else's driver. The correct solution is to debounce the button, that can be done either in hardware (too expensive, so rarely done) or in softeare, by e.g. averaging the last 50 reads and wait till the majority is either off or on.

This should be common knowledge for embedded programmers, but every noe and then you will see someone who has never heard of it.

reply
>averaging the last 50 reads and wait till the majority is either off or on.

This is a bad way to do it because it adds avoidable latency. A moving average is a low-pass filter. The switch bounce is better handled by hysteresis. Change state as soon as you see an edge, then ignore further edges until a timer expires, e.g. 5 ms, which should be enough for the bouncing to settle. A 5 ms timeout limits your repetition rate to 100 presses per second, which is beyond human capabilities.

You might want a tiny bit of hardware low-pass filtering too, for EMI resistance, but that's with microsecond-scale time constant, not milliseconds.

reply
Lots of replies with good ideas here. The biggest question is that EMI resistance; do you really need to ignore brief closures? In the vast majority of situations, the answer is no.
reply
Yes, you need to deal with EMI and static bursts on your microcontroller inputs.
reply
They didn't say how often the reads are - 50 reads could be only 5ms.
reply
In practice switch bounce often lasts tens or even hundreds of milliseconds, and you need to space out the read process to cover the entire bouncing process if you want to avoid registering fake presses. Using basic averaging means your minimum input latency is going to be ~half your bounce time - which is often way too high for it to feel like real-time input.

If you want to achieve low-latency input, "act on first edge, then ignore for the switch bounce period" is a far better approach. It also conveniently solves the "press, then release within bounce period" problem where an averaging algorithm would completely ignore the button press.

reply
the cheaper the switch, the longer the bounce.
reply
Regardless, if the problem is an input that normally registers the state of a button except for noise for some time as it bounces when it transitions, 48 of those reads, the averaging and the 5ms latency that incurs are unnecessary with respect to the problem.

An averaging filter makes sense if you have a noisy analog input. For a button input that registers whether it is pressed or not except for a known noise around transitions specifically, ignoring the transitions immediately after the first one registered is not only faster (both in terms of latency and CPU cost) but easier to implement. It's also equally practical for switches with long bounce, where the time it would take for an average to favor a transition might be impractically long.

reply
Latency is cumulative, so avoidable latency is never acceptable. Maybe the hardware will change. Maybe somebody will run your software in an emulator. That 5ms could be enough to push the total latency into the "annoying" level.

And even with no additional latency, 5ms is perceptible in some cases anyway. Microsoft Research has a video demonstration:

https://www.youtube.com/watch?v=vOvQCPLkPt4

reply
deleted
reply
Do you really think the people who programmed your microwave should have taken into consideration that someone might write a microwave emulator in the future? Dealing with that is not their job, it is the job of the emulator creators.
reply
Who says the emulator is "unauthorized"?

For example, smartphone app developers routinely run their apps in emulators first to make the development process more convenient, only running it on a physical device for confirmation when the work is basically done.

Many embedded developers would kill for something similar, and we're already seeing the start of it with platforms like Wokwi. Being able to do integration tests without the physical device itself is an absolute game changer.

reply
Their job probably isn't to invent weird, stupid ways to account for button bounce, either.
reply
Doesn't matter, their way is terrible one that adds latency for no reason.

There are 2 things here worth paying attention

* first "bounce" is user action * last "bounce" is stop of user action.

You can run action on first bounce then just ignore the button for whatver debounce period you deem satisfactory. But adding delay to start action is always wrong answer for debouncing.

Now the harder problem is the off of the button, especially if hold is also an action but "be off for at least few ms" usually handles it well and off time is not lag user feels

reply
No. Act on first transition. Cool down period following. You did not spot something everyone overlooked for 100 years.

There are other situations but not for a button. There are inputs that might be continuously noisy where a sliding window / ring buffer rolling sample is the only way to tell the difference between states. But we are talking about binary input controls actuated by a person, not a thermometer or O2 sensor.

reply
> People who have experience with embedded programming knows that reading out a button usually means denouncing

I know you mean "debouncing" but I love the autocorrect. Like the button is some almighty authority that Denounces noisy signals.

reply
Checking the button state a bunch of times and computing an average wastes a ton of clock cycles that you could be doing anything else (like updating a display, polling sensors, etc).

The standard way to debounce is to attach an timer to the button. When you press the button, an ISR runs that temporarily disable the timer from triggering again and starts the timer for a specific period (say 20ms). The processor is free to do whatever it wants for the next 20ms. When that timer expires, another routine checks to see if the button is still being held, sets the button's state accordingly, then re-enables the button Timer so it can be triggered again.

Averaging loops are much better for analog inputs where you may have noise that throws off the reading. You only care about a button being on or off, it doesn't matter if it's been mostly on for that period only that it's still on.

When you get into extremely fast digital inputs that need to be reacted to sooner than the debounce wait period, that's when you need hardware debouncing.

reply