They were always hardware failures, took about 45-120min. Not the end of the world, but also not fun getting lot of client complaints.
I'm sure there are plenty of the like 1,000 AWS products that DO has no viable competitor for, but for what they do offer, they're great.
Even if you use AWS and the like, if you aren't building your app with redundancy across multiple AZs, then you'll have some downtime occasionally.
And even if you do build redundancy with multiple AZ, some services might fail anyway as AWS is not entirely isolated. So you might have downtimes.
So just accept downtimes and use the best tool for you (unless they are really bad, like GitHub level bad). If you cannot accept any downtime, you'll have to spend millions of dollars and months of work to have the confidence to expect no downtime. Something like Netflix's chaos monkey and infrastructure would be enough.
My gut feeling is that the number of significant applications that have this capability can probably be counted on two hands. Especially since a lot of the largest footprints of software stacks running in the cloud belong to Google and Microsoft, who I'm pretty sure do not replicate their services into someone else's cloud.
As an example, I note that GCP responded within 7 minutes according to their timeline. If you’d been using Cloud Run, that would have reduced downtime by over 7 hours — and there’s a good chance that you never would have gone down in the first place if the unknown trigger event was related to other customer activity or something odd Railway did.
There’s also a complexity factor: note how much complex infrastructure they mentioned having to fix that you wouldn’t need for your own account. That code does useful things, I’m sure, but it’s also a lot of moving parts which a hosting provider needs and you don’t – this outage took everyone down, whereas individual AWS or bare metal users would’ve otherwise been unaffected. There isn’t a global optimum which is the same for everyone but I think developers are prone to wildly over-estimating how much time they save by removing a couple of deployment steps relative to the direct costs and the less obvious costs of working within someone else’s environment.
But really any service (or even on-site hosting) can have downtime, if that's not acceptable then I suppose building/using a tool that can be distributed between multiple hosts located in different geographical areas is the best option.
For Vercel if your nextjs site can be compiled statically you could probably throw it up on almost anything. We've self hosted before which is pretty straightforward but you lose a lot of the image optimization stuff unless you go deep into setting up open next.
Azure!
It’s the enterprise cloud with enterprise support. They won’t randomly pull the plug on your account, unlike companies that have a wildly different cultural background:
Google - ad tech (you’re the product)
Amazon - shop front (you’re a comptetitor)
Oracle - lawyers (you’re a future lawsuit for license extortion)
Etc…
No code lock-in through SDKs and built on top of AWS with great DX for both developer and coding agents