The more you know!
>a Proxmox cluster with two nodes is fucked and why we recommend an additional "witness".
Reminds me of the three Magi from Evangelion: https://magi.kinta.ma/
need a third one to confirm which of the 2 is accurate
One reason mainframes and micros are still around us, is that you can change almost everything between hardware and software without downtime.
It is also available in commercial surviving UNIXes, and as paid for feature in some Linux distros, although not to the extent that those grandparent systems are capable of.
First, you might not reload everything in memory, so it will be patched on disk but not in process.
Second, you have not tested that the system can boot to a functional system. Say you have done live patching for 5 years and never rebooted, and then you have a power loss or hardware failure/upgrade that takes the system down. When you try to bring it back up, it doesn't work. Which configuration change in the past 5 years caused that? Which backup do you use?
And, yeah, everything is hot swappable on VAX. Those machines also cost 6+ figures, and often require a service contract that includes a permanent on site tech.
Only the last generation or 2 of the highest end VAXen had any significant hot swap (VAX 9000/400 and later, which sold very poorly). The vast majority of VAX machines didn't. Even hot-swapping DSSI disks was at best iffy.
When someone whose been there talks about VAX 'high availability', they're usually talking about VAX/VMS clustering. Very cool and generally effective approach to the problem. That was one big issue with the end-game VAXen: clustering a couple of 6-figure mid-range machine was often considered a better solution than all-in on one 7- to 8-figure VAX 'mainframe'.
often require a service contract that includes a permanent on site tech.
I don't recall that being common with DEC service contracts. Most of the sites I know of that had dedicated DEC techs were either very large installs or had...other...drivers (e.g. tech had to have a TS clearance to work on the machines).
All this "we must reboot to test" is bullshit excuses by unqualified workers
How do you know the automatic failover works? How do you know the standby system works?
I’ve seen many a “qualified workers” getting sent packing because they never fully tested the prod system because they just knew everything will work, and never tested the backup systems because qualified workers do the job right the first time, no need for backup.
You design for this with generational tagged objects or something similar.
When you hotpatch the system for years then you have no idea if the system can boot up or it will fail somewhere in the booting process.
i.e. you can only trust what you regularly test.
There were several switch failures in the 1980s / 1990s in which systems which had been upgraded in place without a full restart failed. (IIRC, one burnt down, literally.)
Engineers were uncertain as to whether or not a cold-boot restart was even possible.
Account concerning an AT&T system upgrade sourcing Risks Digest (Vol 9, Issue 62, February 26, 1990) by the recently deceased Peter G. Neumann: <https://telephoneworld.org/landline-telephone-history/the-cr...>.
Not doubting it, only curious about some kind of postmorten.
or translated: https://danskebank-com.translate.goog/da/news-og-insights/ny...
TLDR: power supply failed completely and DB2 failed running recovery operations due to multiple old/existing software bugs.
You can build that way cheaper with 2-3 proper clustered load balancer units, 2-3 application servers behind those and those using persistent storage (databases,ldap, files) which allow writing multiple nodes simultaneously.
I used to work uni that we had few services from 2012 to 2025 my retirement with zero downtime. One time my manager with tech background tried to add PBR in hurry using WebUI and did not understand cli syntax and caused close to require reboot, but I was able to fix it from cli rolling back previous config and rebooting one unit at time. Upgrading software major version up to each unit supported level wasn't hard, upgrade node it joins back cluster, upgrade another node and it joins cluster, all done. Few times I had to fix manually config for some less important test backend servers that I had forgotten to change before upgrade. No big deal. No major outages during all that 13 years time happened. Some redirecting policy and action syntax was first hard to understand and learn like GeoIP, but I was very surprised how darn reliable and nice they to use and maintain.
The LB's were (Citrix) Netscalers in clustering mode (all nodes process traffic concurrently), which allowed live update one node at time without losing any connectivity through them. That wouldn't have been possible devices in just HA mode.
We had just 2 beefy units which worked very well for us, but you can have 2-32 of them in cluster and managing thousands of servers behind them if you need that. Netscalers are FreeBSD derived where quite a bit of the TCP/IP stack was rewritten adding support many some quite odd features std FreeBSD doesn't have. Much of that is IP/ethernet multicast features, PBR's, Traffic Domains (VRF's) and of many service and monitoring processes which sync cluster (or HA) and if node fails another can continue straight from there without any loss of traffic to clients being proxied.
Though I think most people in this forum are familiar with with haproxy, pound and web-server software provided reverse proxying.
A car analogy if previous were your fancy sport sedan Netscaler and F5 BigIP are formula F1 class cars ie. quite different beasts altogether.
e: And proper LB's are not just for HTTPS etc. but very nice proxying many other protocols were they TCP, UDP or something else. We did done VPN's and something like Cisco AP'S CAPWAP (DTLS ie SSL over UDP). e: typo.
Hence my second paragraph.
Thanks for sharing the story.
I think I’m gonna hafta keep waiting...
We have some Sun V880s at work and I'm fairly sure the only part you cannot change with the power on and system running is the motherboard itself.
And I would not be surprised if some ex-Sun Gandalf Beard "well akshully"s this comment.
Care to elaborate? I wanna know more.
https://www.cloudflare.com/learning/dns/dns-records/dns-mx-r...
[1] yes...I know there's a ton of caveats here...
With that said, if high availability is not a concern then 1 can be just fine.