upvote
What struck me though is that OP did so much work to migrare the server with zero downtime. The _single_ big server. Something’s off here.
reply
Well why have downtime if you can avoid it with a bit of work?

But I do agree the poster should think about this. I don't think it's 'off' or misleading, they just haven't encountered a hardware error before. If they had one on this single box with 30 databases and 34 Nginx sites it would probably be a bad time, and yes they should think about that a bit more perhaps.

They describe a db follower for cutover for example but could also have one for backups, plus rolling backups offsite somewhere (perhaps they do and it just didn't make it into this article). That would reduce risk a lot. Then of course they could put all the servers on several boxes behind a load-balancer.

But perhaps if the services aren't really critical it's not worth spending money on that, depends partly what these services/apps are.

reply
Besides, "Migrated 34 websites in one go with zero downtime" looks good on a resume, and is actually a useful skill.
reply
I run internal services on DO that I've considered moving to Hetzner for cost savings.

Could I take it down for the afternoon? Sure. Or could I wait and do it after hours? Also sure. But would I rather not have to deal with complaints from users that day and still go home by 5pm? Of course!

reply
to be fair a lot of ppl still run this way and just have really good backups, or have an offline / truly on-prep server where they can flip the dns switch in case of true outage.
reply
Yes and for many services that is totally fine. As long as you have backups of data and can redeploy easily. It's not how I personally do things usually but there is definitely a place for it.
reply
Good point. I run single big servers. But I can bring them down every weekend for the entire weekend if I need to.
reply
There is software that can help a lot.

Also, in general, you can architect your application to be more friendly to migration. It used to be a normal thing to think about and plan for.

VMware has a conversion tool that converts bare metal into images.

One could image, then do regular snapshots, maybe centralize a database being accessed.

Sometimes it's possible to create a migration script that you run over and over to the new environment for each additional step.

Others can put a backup server in between to not put a load on the drive.

Digital Ocean makes it impossible to download your disk image backups which is a grave sin they can never be forgiven for. They used to have some amount of it.

Still, a few commands can back up the running server to an image, and stream it remotely to another server, which in turn can be updated to become bootable.

This is the tip of the iceberg in the number of tasks that can be done.

Someone with experience can even instruct LLMs to do it and build it, and someone skilled with LLMs could probably work to uncover the steps and strategies for their particular use case.

reply
A week of downtime every decade I think still works out to a higher uptime than I've been getting from parts of GitHub lately. So I'd consider that a win.
reply
Respectfully, this type of "high availability" strawman is a dated take.

This is a general response to it.

I have run hosting on bare metal for millions of users a day. Tens of thousdands of concurrent connections. It can scale way up by doing the same thing you do in a cloud, provision more resources.

For "downtime" you do the same thing with metal, as you do with digital ocean, just get a second server and have them failover.

You can run hypervisors to split and manage a metal server just like Digital Ocean. Except you're not vulnerable to shared memory and cpu exploits on shared hosting like Digital Ocean. When Intel CPU or memory flaws or kernel exploits come out like they have, one VM user can read the memory and data of all the other processes belonging to other users.

Both Digital Ocean, and IaaS/PaaS are still running similar linux technologies to do the failover. There are tools that even handle it automatically, like Proxmox. This level of production grade fail over and simplicity was point and click, 10 years ago. Except no one's kept up with it.

The cloud is convenient. Convenience can make anyone comfortable. Comfort always costs way more.

It's relatively trivial to put the same web app on a metal server, with a hypervisor/IaaS/Paas behind the same Cloudflare to access "scale".

Digital Ocean and Cloud providers run on metal servers just like Hetzner.

The software to manage it all is becoming more and more trivial.

reply
While I generally agree, this is an exaggeration:

> This level of production grade fail over and simplicity was point and click, 10 years ago.

While some of the tools are _designed_ for point and click, they don't always work. Mostly because of bugs.

We run Ceph clusters under our product, and have seen a fair share of non-recoveries after temporary connection loss [1], kernel crashes [2], performance degradations on many small files, and so on.

Similarly, we run HA postgres (Stolon), and found bugs in its Go error checking cause failure to recover from crashes and full-disk conditions [3] [4]. This week, we found that full-disk situations will not necessarily trigger failovers. We also found that if DB connections are exhausted, the dameon that's supposed to trigger postgres failover cannot connect to do that (currently testing the fix).

I believe that most of these things will be more figured out with hosted cloud solutions.

I agree that self-hosting HA with open-source software is the way to. These softwares are good, and the more people use them, the less bugs they will have.

But I wouldn't call it "trivial".

If you have large data, it is also brutally cheaper; we could hire 10 full-time sysadmins for the cost of hosting on AWS, vs doing our own Hetzner HA with Free Software, and we only need ~0.2 sysadmins. And it still has higher uptime than AWS.

It is true that Proxomox is easy to setup and operate. For many people it will probably work well for a long time. But when things aren't working, it's not so easy anymore.

[1]: "Ceph does not recover from 5 minute network outage because OSDs exit with code 0" - https://tracker.ceph.com/issues/73136

[2]: "Kernel null pointer derefecence during kernel mount fsync on Linux 5.15" - https://tracker.ceph.com/issues/53819

[3]: https://github.com/sorintlab/stolon/issues/359#issuecomment-...

[4]: https://github.com/sorintlab/stolon/issues/247

reply
I'm not arguing for cloud or against bare metal hosting, just saying there is a broad range of requirements in hosting and not everyone needs or wants load balancers etc - it clearly will cost more than this particular poster wants to pay as they want to pay the bare minimum to host quite a large setup.
reply
[dead]
reply
I feel like 95% of the web falls into this category. Like, have you ever said "That's it, I am never gonna visit this page again!", because of temporary downtime? Unless you are Amazon and every minute costs you bazillions, you are likely gonna get the better deal not worrying about availability and scalability. That 250€/m root server is a behemoth. Complete overkill for most anything. As a bonus, you are gonna be half the internet, when someone at AWS or Cloudflare touches DNS.
reply
Exactly. I've never not bought something because the website was temporarily down. I've even bought from b&h photo!

Even if Amazon was down, if I was planning to buy, I'd wait. heck, I got a bunch of crap in my cart right now I haven't finished out.

Intentional downtime lets everyone plan around it, reduces costs by not needing N layers of marginal utility which are all fragile and prone to weird failures at times you don't intend.

reply
For me at least, the only thing where availability really matters is main personal communication services. If Signal was down for an hour, I'd be a little stressed. Maybe utilities like public transportation, too, but that's because I now have to do that online.

> Intentional downtime lets everyone plan around it, reduces costs by not needing N layers of marginal utility which are all fragile and prone to weird failures at times you don't intend.

Quite frankly, I would manage if things were run "on-supply" with solar and would just go dark at night.

reply
> Like, have you ever said "That's it, I am never gonna visit this page again!", because of temporary downtime?

That's a strawman version of what happens.

There have been times when I've tried to visit a webshop to buy something but the site was broken or down, so I gave up and went to Amazon and bought an alternative.

I've also experienced multiple business situations where one of our services went down at an inconvenient time, a VP or CEO got upset, and they mandated that we migrate away from that service even if alternatives cost more.

If you think of your customers or visitors as perfectly loyal with infinite patience then downtime is not a problem.

> Unless you are Amazon and every minute costs you bazillions, you are likely gonna get the better deal not worrying about availability and scalability. That 250€/m root server is a behemoth. Complete overkill for most anything.

You don't need every minute of downtime to cost "bazillions" to justify a little redundancy. If you're spending 250 euros/month on a server, spending a little more to get a load balancer and a pair of servers isn't going to change your spend materially. Having two medium size servers behind a load balancer isn't usually much more expensive than having one oversized server handling it all.

There are additional benefits to having the load balancer set up for future migrations, or to scale up if you get an unexpected traffic spike. If you get a big traffic spike on a single server and it goes over capacity you're stuck. If you have a load balancer and a pair of servers you can easily start a 3rd or 4th to take the extra traffic.

reply
> There have been times when I've tried to visit a webshop to buy something but the site was broken or down, so I gave up and went to Amazon and bought an alternative.

Great. So how much did the webshop lose in that hour of maintenance (which realistically would be in the middle of the night for their main audience) and how much would they have paid for redundancy? Also a bit hard to believe you repeatedly ran into the situation of an item sold at a self-hosted webshop and Amazon alike. Are you sure they haven't just messed up the web dev biz? You could totally do that with AWS too...

> If you're spending 250 euros/month on a server, spending a little more to get a load balancer and a pair of servers isn't going to change your spend materially.

Of course, but that's not the argument. It's implied you can just double the 250€/m server for redundancy, as you would still get an offer at the fraction of cloud prices. But really that server needs no more optimization in terms of hardware diversification. As I said, it's complete overkill. Blogs and forums could easily be run on a 30€/m recycled machine.

reply
> Like, have you ever said "That's it, I am never gonna visit this page again!"

Spot on! People still go to Chick-fil-A, even if they are closed on Sundays!

reply