Services where uptime matters tend to be designed so they can tolerate the reboot of a single node for other reasons besides kernel maintenance. I can't imagine a situation where I can't tolerate the downtime of a reboot but I would be willing to risk the system locking up with brain surgery gone wrong.
I've run systems with live code updates for userland, and would have considered live kernel updates if it was reasonable on our systems.
The thing is you typically build your system to tolerate reboot or unscheduled stop of a single node. Scheduled stop is nicer, but systems sometimes lock up even when you're not doing risky behaviors, so you know.
But just because the system can tolerate a reboot or restart doesn't mean it's not disruptive. A lock up / etc during hot load is also disruptive, of course. But when you can push code without having to stop anything, with limited impact on users, it makes it easier and faster to do updates. You can use whatever rollout pattern you like to contain risk too; same as you would for an upgrade with restarts.
For us, we have servers with hundreds of thousands or millions of tcp connections from mobile clients. Restarting a server would make all those clients have to reconnect and connecting is expensive. Restarting all the servers would result in many clients reconnecting several times. It was better to avoid it when possible.
Consider a hyper-converged cluster with many nodes serving distributed block storage, say at N=3 replication. This can tolerate exactly one N=1 node of outage for the reboot. It would seem preferable to drain the nodes in a way that allows for more parallelism in the per-node kernel-reboot process, but draining is expensive and its cheaper to reboot and hope the data comes back to the pool within some period of time after the reboot. This gets worse linearly as the cluster grows.
A non-trivial size cluster facing this can have a reboot rollout easily stretch from hours into days and even weeks. It is further made slower when the roll-out itself is repeatedly paused when any other production issue is detected, or some other in-cluster event is happening and distributed storage health is degraded or unavailable. If a single (additional) node goes out during the reboot roll-out, data goes unavailable and storage must wait and heal. It also simply takes time for the cluster to reconcile when the storage eventually comes back from reboot to make sure it is all still there.
If your systems are large enough, things will go so slow that things fall into the trap where the target release changes mid-deployment: to benefit from everything learned in the last many days or weeks, security, performance, crashes, whatever! There is benefit because the fixes you cared about most got onto a portion of the cluster sooner than later. There is also penalty, as this resets the time it takes to deploy, elongating the perceived end-to-end deployment time. This negatively affects OKRs and similarly displaces the release of anything that was queued for upcoming releases.
So yeah, live patching is great to get priority fixes out in a matter of minutes or hours. I also think it is the best tool to get oneself out of this rollout-reset trap and onto the next release sooner. Faster than rollback or rollover.
> I can't imagine a situation where I can't tolerate the downtime of a reboot but I would be willing to risk the system locking up with brain surgery gone wrong.
Because you haven't worked at that level in organization. Doing restart in some case might involve paperwork with your client and maintenance window outside of working hours even if service is redundant. And some customers are fine with a little bit of downtime and don't want active/active level of redundancy but still insist of maintenance windows for any work like that.
Live patching makes that a whole lot easier
$200/year is peanuts for any commercial use worth the name. The problem, of course, is the whole non-free infrastructure it has to introduce.
I wonder when large and critical OSS projects will start to be seen as a public good they are, with large corporations willingly financing them because not doing so is bad PR.
Attacks are still possible, but if we look at xz backdoor attack[1] it was insanely complicated attack and it still failed. Its fail doesn't look promising, attack could succeed just the attacker was unlucky. Still it shows that the success is not guaranteed.
Theoretically npm can be improved in this way, if there were a separate "distro" for packaged, with dedicated maintainers for packages, who don't write code, just pull it from a mainstream and review it. It is not being done because of tragedy of commons, not because it is impossible.
Maybe I have not enough fantasy and/or creativity, but trying to imagine it, I see just a bit more of oversight built into protocols of approving changes to repositories. I mean, it doesn't seem that improved security needs an approach "destroy everything and build it from scratch", some additions on top of existing structures would do. Am I wrong?
Are you arguing that the system may be more resilient than it seems? Like, maybe there is a conspiracy working on security. And they keep themselves secret so attackers would be susceptible to under-appreciate the real level of security and make mistakes that inevitable would caught?
It seems like a over-stretched explanation, doesn't it. Care to explain yourself?
But I absolutely belive we should have a method for changing kernel configuration (e.g. kernel module blacklists) and syscall firewalls and alike.
Maybe all of those userspace-work-done-in-kernel-because-muh-performance features should be restricted to (the "real") CAP_NET_ADMIN, unless positively enumerated as free-for-all-containers. And then subtract from that free-for-all list every time you learn that some kernel module in its currently available version cannot be trusted to do its own memory shuffling.
I think we can learn many lessons from the recent SNAFUs before going all wild on auto-patching.
One lesson for example is that you shouldn't compile into the kernel modules that only about 0.00001% of all Linux installations out there are ever going to use.
Another lesson is that even if the modules are compiled, but not into the kernel, they should probably be blacklisted (preventing them from loading) by default and only removed from the blacklist by people who really know they'll need these rarely used modules.
We're way past the "but it needs to work on all cases": we're now into the "users installing our distro are getting hacked left and right" territory.
In any case I think many things can be done before Linux distros reproduce the "security" practices of the NPM ecosystem.
Are we? Are users actually getting hacked, or have they theoretically been exposed to problems that could allow local privileged escalation if exploited but that nobody's seen used in the wild?
(Edit: To be clear, I'm skeptical but this isn't a completely rhetorical question. If there are actual reports of these vulns causing problems, that would strongly incentivize a stronger response.)
Perhaps we should tend toward the first.
For Gentoo, of course, "just recompile the kernel as desired" is more reasonable, though they have binary packages including for the kernel and I don't see why the same idea shouldn't work there.
But I don't want to know what drivers I need and will need next. Tomorrow I could buy a different wifi module and then what? Spend 3 hours googling which rtl378326973268632aahaxhabt.ko to install? Thanks but no thanks.
We can have security and convenience.
It would work for various other drivers though.
You can do blacklists easy enough if you want to, just add few lines of text into /etc.
I'd also like option for whitelisting, like whitelisting every single NIC driver is harmless enough coz they just won't be loaded, but anything that can be loaded by non-root userspace action should have option to be only loaded if it is on whitelist.
Tho all that is easily doable by just changing userspace AFAIK