For those that do, your scaling example works against you. If today you can merge three services into one, then why do you need full time infrastructure staff to manage so few servers? And remember, you want 24/7 monitoring, replication for disaster recovery, etc. Most businesses do not have IT infrastructure as a core skill or differentiator, and so they want to farm it out.
This is really the core problem. Every time I’ve done the math on a sizable cloud vs on-prem deployment, there is so much money left on the table that the orgs can afford to pay FAANG-level salaries for several good SREs but never have we been able to find people to fill the roles or even know if we had found them.
The numbers are so much worse now with GPUs. The cost of reserved instances (let alone on-demand) for an 8x H100 pod even with NVIDIA Enterprise licenses included leaves tens of thousands per pod for the salary of employees managing it. Assuming one SREs can manage at least four racks the hardware pays for itself, if you can find even a single qualified person.
The first is that SRE team size primarily scales with the number of applications and level of support. It does scale with hardware but sublinearly, where number of applications usually scales super linearly. It takes a ton less effort to manage 100 instances of a single app than 1 instance of 100 separate apps (presuming SRE has any support responsibilities for the app). Talking purely in terms of hardware would make me concerned that I’m looking at an impossible task.
The second (which you probably know, but interacts with my next point) is that you never have single person SRE teams because of oncall. Three is basically the minimum, four if you want to avoid oncall burnout.
The last is that I don’t know many SREs (maybe none at all) that are well-versed enough in all the hardware disciplines to manage a footprint the size we’re talking. If each SRE is 4 racks and a minimum team size is 4, that’s 16 racks. You’d need each SRE to be comfortable enough with networking, storage, operating system, compute scheduling (k8s, VMWare, etc) to manage each of those aspects for a 16 rack system. In reality, it’s probably 3 teams, each of them needs 4 members for oncall, so a floor of like 48 racks. Depending on how many applications you run on 48 racks, it might be more SREs that split into more specialized roles (a team for databases, a team for load balancers, etc).
Numbers obviously vary by level of application support. If support ends at the compute layer with not a ton of app-specific config/features, that’s fewer folks. If you want SRE to be able to trace why a particular endpoint is slow right now, that’s more folks.
That's vastly overstating it. You hit nail in the head in previous paragraphs, it's number of apps (or more generally speaking ,environments) that you manage, everything else is secondary.
And that is especially true with modern automation tools. Doubling rack count is big chunk of initial time spent moving hardware of course, but after that there is almost no difference in time spent maintaining them.
In general time per server spent will be smaller because the bigger you grow the more automation you will generally use and some tasks can be grouped together better.
Like, at previous job, server was installed manually, coz it was rare.
At my current job it's just "boot from network, pick the install option, enter the hostname, press enter". Doing whole rack (re)install would take you maybe an hour, everything else in install is automated, you write manifest for one type/role once, test it, and then it doesn't matter whether its' 2 or 20 servers.
If we grew server fleet say 5-fold, we'd hire... one extra person to a team of 3. If number of different application went 5-fold we'd probably had to triple the team size - because there is still some things that can be made more streamlined.
Tasks like "go replace failed drive" might be more common but we usually do it once a week (enough redundancy) for all servers that might've died, if we had 5x the number of servers the time would be nearly the same because getting there dominates the 30s that is needed to replace one.
If you're doing regular inference for a product with very flat throughput requirements (and you're doing on-prem already), on-prem GPUs can make a lot of sense.
But if you're doing a lot of training, you have very bursty requirements. And the H100s are specifically for training.
If you can have your H100 fleet <38% utilized across time, you're losing money.
If you have batch throughput you can run on the H100s when you're not training, you're probably closer to being able to wanting on-prem.
But the other thing to keep in mind is that AWS is not the only provider. It is a particularly expensive provider, and you can buy capacity from other neoclouds if you are cost-sensitive.
AWS charges $55/hour for EC2 p5.48xlarge instance, which goes down with 1 or 3 year commitments.
With 1 year commitment, it costs ~$30/hour => $262k per year.
3-year commitment brings price down to $24/hour => $210k per year.
This price does NOT include egress, and other fees.
So, yeah, there is a $120k-$175k difference that can pay for a full-time on-site SRE, even if you only need one 8xH100 server.
Numbers get better if you need more than one server like that.
Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
If there's an issue with the server while they're sick or on vacation, you just stop and wait.
If they take a new job, you need to find someone to take over or very quickly hire a replacement.
There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
Going on-prem like this is highly risky. It works well until the hardware starts developing problems or the person in charge gets a new job. The weeks and months lost to dealing with the server start to become a problem. The SRE team starts to get tired of having to do all of their work on weekends because they can't block active use during the week. Teams start complaining that they need to use cloud to keep their project moving forward.
> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
> If there's an issue with the server while they're sick or on vacation, you just stop and wait.
Very much depends on what you're doing, of course, but "you just stop and wait" for sickness/vacation sometimes is actually good enough uptime -- especially if it keeps costs down. I've had that role before... That said, it's usually better to have two or three people who know the systems though (even if they're not full time dedicated to them) to reduce the bus factor.
You can still use cloud for excess capacity when needed. E.g. use on-prem for base load, and spin up cloud instances for peaks in load.
they come with warranty, often with technican guaranteed to arrive within few hours or at most a day. Also if SHTF just getting cloud to augument current lackings isn't hard
These come in a non-flakey variant?
And the other argument: every company I've ever know to do AWS has an AWS sysadmin (sorry "devops"), same for Azure. Even for small deployments. And departments want their own person/team.
You can ask AI to troubleshoot and fix the issue.
> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
Literally this. I can do SRE on-prem and cloud, and my 50/30/20 budget break-even point (as in, needs and savings but no wants - so 70%) is $170k before taxes. Rent is astonishingly high right now, and the sort of mid-career professional you want to handle SRE for your single DC is going to take $150k in this market before fucking off to the first $200k job they get.
Know your market, and pay accordingly. You cannot fuck around with SREs.
> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
This is less of an issue than you might think, but strongly dependent upon the quality of talent you’ve retained and the budget you’ve given them. Shitbox hardware or cheap-ass talent means you’ll need to double or triple up locally, but a quality candidate with discretion can easily be supported by a counterpart at another office or site, at least short-term. Ideally though, yeah, you’ll need two engineers to manage this stack, but AWS savings on even a modest (~700 VMs) estate will cover their TC inside of six months, generally.
> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
This strikes at another workload I neglected to mention, and one I highly recommend keeping in the public cloud: GPUs.
GPUs on-prem suck. Drivers are finnicky, firmware is flakey, vendor support inconsistent, and SR-IOV is a pain in the ass to manage at scale. They suck harder than HBAs, which I didn’t think was possible.
If you’re consuming GPUs 24x7 and can afford to support them on-prem, you’re definitely not here on HN killing time. For everyone else, tune your scaling controls on your cloud provider of choice to use what you need, when you need it, and accept the reality that hyperscalers are better suited for GPU workloads - for now.
> Going on-prem like this is highly risky.
Every transaction is risky, but the risk calculus for “static” (ADDS) or “stable” (ERP, HRIS, dev/test) work makes on-prem uniquely appealing when done right. Segment out your resources (resist the urge for HPC or HCI), build sensible redundancies (on-prem or in the cloud), and lean on workhorse products over newer, fancier platforms (bulletproof hypervisors instead of fragile K8s clusters), and you can make the move successful and sensible. The more cowboy you go with GPUs, K8s, or local Terraform, the more delicate your infra becomes on-prem - and thus the riskier it is to keep there.
Keep it simple, silly.
The company did need the same exact people to manage AWS anyway. And the cost difference was so high that it was possible to hire 5 more people which wasn't needed anyway.
Not only the cost but not needing to worry about going over the bandwidth limit and having soo much extra compute power made a very big difference.
Imo the cloud stuff is just too full of itself if you are trying to solve a problem that requires compute like hosting databases or similar. Just renting a machine from a provider like Hetzner and starting from there is the best option by far.
That is incorrect. On AWS you need a couple DevOps that will Tring together the already existing services.
With on premise, you need someone that will install racks, change disks, setup high availability block storage or object storage, etc. Those are not DevOps people.
we have 7 racks and 3 people. The things you mentioned aren't even 5% of the workload.
There are things you figure out once, bake into automation, and just use.
You install server once and remove it after 5-10 years, depending on how you want to depreciate it. Drives die rarely enough it's like once every 2 months event at our size
The biggest expense is setting up automation (if I was re-doing our core infrastructure from scratch I'd probably need good 2 months of grind) but after that it's free sailing. Biggest disadvantage is "we need a bunch of compute, now", but depending on business that might never be a problem, and you have enough savings to overbuild a little and still be ahead. Or just get the temporary compute off cloud.
Real Devops people are competent from physical layer to software layer.
Signed,
Aerospace Devop
Before I drop 5 figures on a single server, I'd like to have some confidence in the performance numbers I'm likely to see. I'd expect folk who are experienced with on-prem have a good intuition about this - after a decade of cloud-only work, I don't.
Also, cloud networking offers a bunch of really nice primitives which I'm not clear how I'd replicate on-prem.
I've estimated our IT workload would roughly double if we were to add physically racking machines, replacing failed disks, monitoring backups/SMART errors etc. That's... not cheap in staff time.
Moving things on-prem starts making financial sense around the point your cloud bills hit the cost of one engineers salary.
Like what?
On top of that no one really knows what the fuck they are doing in AWS anyway.
That's partially true; managing cloud also takes skill, most people forget that with end result being "well we saved on hiring sysadmins, but had to have more devops guys". Hell I manage mostly physical infrastructure (few racks, few hundred VMs) and good 80% of my work is completely unrelated to that, it's just the devops gluing stuff together and helping developers to set their stuff up, which isn't all that different than it would be in cloud.
> And remember, you want 24/7 monitoring, replication for disaster recovery, etc.
And remember, you need that for cloud too. Plenty of cloud disaster stories to see where they copy pasted some tutorial thinking that's enough then surprise.
There is also partial way of just getting some dedicated servers from say OVH and run infra on that, you cut out a bit of the hardware management from skillset and you don't have the CAPEX to deal with.
But yes, if it is less than at least a rack, it's probably not worth looking for onprem unless you have really specific use case that is much cheaper there (I mean less than usual half)
Is it still a problem in 2026 when unemployment in IT is rising? Reasons can be argued (the end of ZIRP or AI) but hiring should be easier than it was at any time during the last 10 years.
Not quite. If you hire a bad talent to manage your 'cloud gear' then you would find what the mistakes which would cost you nothing on-premises would cost you in the cloud. Sometimes - a lot.
I’ve had success with this approach by keeping it to only the business process management stacks (CRMs, AD, and so on—examples just like the ones you listed). But as soon as there’s any need for bridging cloud/onprem for any data rate beyond “cronned sync” or “metadata only”, it starts to hurt a lot sooner than you’d expect, I’ve found.
Folks wanting one or the other miss savings had by effectively leveraging both.
(For various reasons, I just care about VPS/bare metal, and S3-compatiblity.)
I'm looking at those because I'm having difficulty forecasting bandwidth usage, and the pessimistic scenarios seem to have me inside the acceptable use policies of the small providers while still predicting AWS would cost 5-10x more for the same workload.
Its hard drive and SSD space prices that stagger me on the cloud. Where one of the server CPUs might only be about 2x the price of buy a CPU for a few years if you buy less in a small system (all be it with less clock speed usually on the cloud) the drive space is at least 10-100x the price of doing it locally. Its got a bit more potential redudency but for that overhead you can repeat that data a lot of times.
As time has gone on the deal of cloud has got worse as the hardware got more cores.
At my job we use HyperV, and finding someone who actually knows HyperV is difficult and expensive. Throw in Cisco networking, storage appliances, etc to make it 99.99% uptime...
Also that means you have just one person, you need at least two if you don't want gaps in staffing, more likely three.
Then you still need all the cloud folks to run that.
We have a hybrid setup like this, and you do get a bit of best of both worlds, but ultimately managing onprem or colo infra is a huge pain in the ass. We only do it due to our business environment.
All of the complexity of onprem, especially when you need to worry about failover/etc can get tricky, especially if you are in a wintel env like a lot of shops are.
i.e. lots of companies are doing sloppy 'just move the box to an EC2 instance' migrations because of how VMWare jacked their pricing up, and now suddenly EC2/EBS/etc costing is so cheap it's a no brain choice.
I think the knowledge base to set up a minimal cost solution is too tricky to find a benefit vs all the layers (as you almost touched on, all the licensing at every layer vs a cloud provider managing...)
That said, rug pulls are still a risk; I try to push for 'agnostic' workloads in architecture, if nothing else because I've seen too many cases where SaaS/PaaS/etc decide to jack up the price of a service that was cheap, and sure you could have done your own thing agnostically, but now you're there, and migrating away has a new cost.
IOW, I agree; I don't think the human capital is there as far as infra folks who know how to properly set up such environments, especially hitting the 'secure+productive' side of the triangle.
> At my job we use HyperV, and finding someone who actually knows HyperV is difficult and expensive...
Try offering significantly higher pay.
If that's you then the GraniteRapids AP platform that launched previously to this can hit similar numbers of threads (256 for the 6980P). There are a couple of caveats to this though - firstly that there are "only" 128 physical cores and if you're using VMs you probably don't want to share a physical core across VMs, secondly that it has a 500W TDP and retails north of $17000, if you can even find one for sale.
Overall once you're really comparing like to like, especially when you start trying to have 100+GbE networking and so on, it gets a lot harder to beat cloud providers - yes they have a nice fat markup but they're also paying a lot less for the hardware than you will be.
Most of the time when I see takes like this it's because the org has all these fast, modern CPUs for applications that get barely any real load, and the machines are mostly sitting idle on networks that can never handle 1/100th of the traffic the machine is capable of delivering. Solving that is largely a non-technical problem not a "cloud is bad" problem.
Im pretty sure a box like this could run our whole startup, hosting PG, k8s, our backend apis, etc, would be way easier to setup, and not cost 2 devops and $40,000 a month to do it.
* sign the papers for server colo * get quote and order servers (which might take few weeks to deliver!), near always a pair of switches * set them up, install OSes, set up basic services inside the network (DNS, often netboot/DHCP if you want to have install over network, and often few others like image repository, monitoring etc.)
It's "we have product and cashflow, let's give someone a task to do it" thing, not "we're a startup ,barely have PoC" thing
On-prem wins for a stable organization every time though.
It's unfortunately not so cut and dry
>Anyway, I wasn't able to test using proxmox or vmware on there to split up cpu/memory resources; we decided instead to just buy a bunch of smaller-core-count AMD Ryzen 1Us instead, which scaled way better with my naive approac
If that was single 384 (192 times 2 for hyperthreading) CPU you are getting "only" 12 DDR5 channels, so one RAM channel is shared by 16c/32y
So just plain 16 core desktop Ryzen will have double memory bandwidth per core
And 384 actual cores or 384 hyperthreading cores?
Inference is so memory bandwidth heavy that my expectations are low. An EPYC getting 12 memory channels instead of 2 only goes so far when it has 24x as many cores.
The core density is bullshit when each core is so slow that it can't do any meaningful work. The reality is that Intel is 3 times behind AMD/TSMC on performance vs power consumption ratio.
People would be better off having a look at the high frequency models (9xx5F models like the 9575F), that was the first generation of CPU server to reach ~5 GHz and sustain it on 32+ cores.
AMD has had these sorts of densities available for a minute.
> Identify the workloads that haven't scaled in a year.
I have done this math recently, and you need to stop cherry picking and move everything. And build a redundant data center to boot.
Compute is NOT the major issue for this sort of move:
Switching and bandwidth will be major costs. 400gb is a minimum for interconnects and for most orgs you are going to need at least that much bandwidth top of rack.
Storage remains problematic. You might be able to amortize compute over this time scale, but not storage. 5 years would be pushing it (depending on use). And data center storage at scale was expensive before the recent price spike. Spinning rust is viable for some tasks (backup) but will not cut it for others.
Human capital: Figuring out how to support the hardware you own is going to be far more expensive than you think. You need to expect failures and staff accordingly, that means resources who are going to be, for the most part, idle.
I agree, but.
For one, it's not just the machines themselves. You also need to budget in power, cooling, space, the cost of providing redundant connectivity and side gear (e.g. routers, firewalls, UPS).
Then, you need a second site, no matter what. At least for backups, ideally as a full failover. Either your second site is some sort of cloud, which can be a PITA to set up without introducing security risks, or a second physical site, which means double the expenses.
If you're a publicly listed company, or live in jurisdictions like Europe, or you want to have cybersecurity insurance, you have data retention, GDPR, SOX and a whole bunch of other compliance to worry about as well. Sure, you can do that on-prem, but you'll have a much harder time explaining to auditors how your system works when it's a bunch of on-prem stuff vs. "here's our AWS Backup plans covering all servers and other data sources, here is the immutability stuff, here are plans how we prevent backup expiry aka legal hold".
Then, all of that needs to be maintained, which means additional staff on payroll, if you own the stuff outright your finance team will whine about depreciation and capex, and you need to have vendors on support contracts just to get firmware updates and timely exchanges for hardware under warranty.
Long story short, as much as I prefer on-prem hardware vs the cloud, particularly given current political tensions - unless you are a 200+ employee shop, the overhead associated with on-prem infrastructure isn't worth it.
You can technically have backblaze's unlimited backup option which costs around 7$ for a given machine although its more intended for windows, there have been people who make it work and Daily backups and it should work with gdpr (https://www.backblaze.com/company/policy/gdpr) with something like hetzner perhaps if you are worried about gdpr too much and OVH storage boxes (36 TB iirc for ~55$ is a good backup box) and you should try to follow 3-2-1 strategy.
> Then, all of that needs to be maintained, which means additional staff on payroll, if you own the stuff outright your finance team will whine about depreciation and capex, and you need to have vendors on support contracts just to get firmware updates and timely exchanges for hardware under warranty.
I can't speak for certain but its absolutely possible to have something but iirc for companies like dell, its possible to have products be available on a monthly basis available too and you can simply colocate into a decent datacenter. Plus points in that now you can get 10-50 GB ports as well if you are too bandwidth hungry and are available for a lot lot more customizable and the hardware is already pretty nice as GP observed. (Yes Ram prices are high, lets hope that is temporary as GP noted too)
I can't speak about firmware updates or timely exchanges for hardware under security.
That being said, I am not saying this is for everyone as well. It does essentially boils down to if they have expertise in this field/can get expertise in this field or not for cheaper than their aws bills or not. With many large AWS bills being in 10's of thousands of dollars if not hundreds of thousands of dollars, I think that far more companies might be better off with the above strategy than AWS actually.
Sure, but it doesn't solve the issue of "the datacenter is on fire" - neither if you're fully on prem or if you use colocation. You still need to acquire a new set of hardware, rack it, reconfigure the networking hardware and then restore from backups. That's an awful lot of work, and yes, I've been there.
Not a shortage - price gouging. And it would mean an increase in the 'cloud' prices because they need to refresh the HW too. So by the summer the equation would be back to it.