upvote
I'd go with bad UI/UX.

A lot of progress has been made by acknowledging that people are idiots and that the system has to work around that. Toyota, which went from one of the worst to one the most reliable automaker is known for formalizing idiot-proofing.

If the reader was able to read the card both way, there wouldn't have been a problem and no training required. The next best thing would be for the card to not fit upside down. Or have a clear message "try flipping the card". It is not something you should train people for, it should be obvious.

I also suspect the reader was in an unusual configuration, because everyone knows how to use smart cards and they probably did what they always do instinctively and it didn't work. On the thousands of times I did it, I don't remember having ever inserted my credit card the wrong way and don't remember anyone who did, it is just so instinctive. For an entire team to miss that, there must be something wrong with how the reader is set up.

reply
> On the thousands of times I did it, I don't remember having ever inserted my credit card the wrong way and don't remember anyone who did, it is just so instinctive.

I have done it lots of times! With machines where you just dip the tip, you're bound to put the side with the chip in, but most machines want it facing up, and some want it the other way. The iconography is only illustrative once you've messed it up at those machines enough times (around me, Walgreens has difficult machines). Readers where you insert the whole card are easier to mess up, too.

> If the reader was able to read the card both way, there wouldn't have been a problem and no training required. The next best thing would be for the card to not fit upside down. Or have a clear message "try flipping the card". It is not something you should train people for, it should be obvious.

I suspect the HSM was an off the shelf component. The real issue with training is that a system with a complex startup procedure hadn't been restarted in 5 years. You should rehearse complex procedures at least once a year, otherwise there's a good chance nobody with experience has done it. Also, maybe someone would have flagged the issue of needing the cards to start the system than grants access to the cards. (Although drill + 1 hour is a reasonable recovery procedure that was obvious and didn't need training, apparently)

reply
Agree.

The fundamental lesson of at least half my information systems undergraduate courses was you adapt the system to observed user behavior, do not expect the user to adapt their behavior to the system.

reply
I would say of all companies that have great SRE, I would not have expected Google to be one of them were this process was so brutaly flawed:

- Storing the safes password - which is required for the password manager to start - ... in this very same password manager? - Failing at trying to insert the card in multiple ways into the card reader (it's like USB, you're using it the wrong way around). I would have tried that before (while?) drilling the safe. - Having no clue (no documentation) how to restart the service, despite it having passwords in it? If passwords are lost, all encrypted stuff is lost, forever.

If there's one thing I think is central to document personal or corporate), it is how to get accesss to passwords _fast and reliable_ whenever there's a disaster recovery.

reply
What part of "best effort, unsupported" do you not understand, if you've read the article?

You're underestimating the amount of goodwill-run, unstaffed projects that any big corporation accrues over time, which accidentally become load bearing without anyone realizing until something goes wrong. Such unstaffed projects are usually very stable (from not having pressure to add features or earn profit) and therefore "just work" for years until something unusual, like an accidental DDoS, happens. In that time, the original author(s) and everyone with context have left the company. This is a very hard process/human problem to solve at FAANG scale.

reply
If it's not obvious to multiple Google SREs and no instruction sticker was present, that's a bad UI.
reply