1) Senior engineer starts on AWS
2) Senior engineer leaves because our industry does not value longevity or loyalty at all whatsoever (not saying it should, just observing that it doesn't)
3) New engineer comes in and panics
4) Ends up using a "managed service" to relieve the panic
5) New engineer leaves
6) Second new engineer comes in and not only panics but outright needs help
7) Paired with some "certified AWS partner" who claims to help "reduce cost" but who actually gets a kickback from the extra spend they induce (usually 10% if I'm not mistaken)
Calling it it ransomware is obviously hyperbolic but there are definitely some parallels one could draw
On top of it all, AWS pricing is about to massively go up due to the RAM price increase. There's no way it can't since AWS is over half of Amazon's profit while only around 15% of its revenue.
In theory with perfect documentation they’d have a good head start to learn it, but there is always a lot of unwritten knowledge involved in managing an inherited setup.
With AWS the knowledge is at least transferable and you can find people who have worked with that exact thing before.
Engineers also leave for a lot of reasons. Even highly paid engineers go off and retire, change to a job for more novelty, or decide to try starting their own business.
unfortunately it lot of things in AWS that also could be messed up so it might be really hard to research what is going on. For example, you could have hundreds of Lambdas running without any idea where original sources and how they connected to each-other, or complex VPCs network routing where some rules and security groups shared randomly between services so if you do small change it could lead to completely difference service to degrade (like you were hired to help with service X but after you changes some service Y went down and you even not aware that it existed)
"Today, we are going to calculate the power requirements for this rack, rack the equipment, wire power and network up, and learn how to use PXE and iLO to get from zero to operational."
Part of what clouds are selling is experience. A "cloud admin" bootcamp graduate can be a useful "cloud engineer", but it takes some serious years of experience to become a talented on prem sre. So it becomes an ouroboros: moving towards clouds makes it easier to move to the clouds.
That is not true. It takes a lot more than a bootcamp to be useful in this space, unless your definition is to copy-paste some CDK without knowing what it does.
If by useful you mean "useful at generating revenue for AWS or GCP" then sure, I agree.
These certificates and bootcamps are roughly equivalent to the Cisco CCNA certificate and training courses back in the 90's. That certificate existed to sell more Cisco gear - and Cisco outright admitted this at the time.
But will the market demand it? AWS just continues to grow.
The number of things that these 24x7 people from AWS will cover for you is small. If your application craps out for any number of reasons that doesn't have anything to do with AWS, that is on you. If your app needs to run 24x7 and it is critical, then you need your own 24x7 person anyway.
Meanwhile AWS breaks once or twice a year.
I've only had one outage I could attribute to running on-prem, meanwhile it's a bit of a joke with the non-IT staff in the office that when "The Internet" (i.e. Cloudflare, Amazon) goes down with news reports etc our own services are all running fine.
I am sure it happens a multitude of ways but I have never seen the case you are describing.
What do you think RedHat support contracts are? This situation exists in every technology stack in existence.
> 4) Ends up using a "managed service" to relieve the panic
It's not as though this is unique to cloud.
I've seen multiple managers come in and introduce some SaaS because it fills a gap in their own understanding and abilities. Then when they leave, everyone stops using it and the account is cancelled.
The difference with cloud is that it tends to be more central to the operation, so can't just be canceled when an advocate leaves.
I'll give you an alternative scenario, which IME is more realistic.
I'm a software developer, and I've worked at several companies, big and small and in-between, with poor to abysmal IT/operations. I've introduced and/or advocated cloud at all of them.
The idea that it's "more expensive" is nonsense in these situations. Calculate the cost of the IT/operations incompetence, and the cost of the slowness of getting anything done, and cloud is cheap.
Extremely cheap.
Not only that, it can increase shipping velocity, and enable all kinds of important capabilities that the business otherwise just wouldn't have, or would struggle to implement.
Much of the "cloud so expensive" crowd are just engineers too narrowly focused on a small part of the picture, or in denial about their ability to compete with the competence of cloud providers.
This has been my experience as well. There are legitimate points of criticism but every time I’ve seen someone try to make that argument it’s been comparing significantly different levels of service (e.g. a storage comparison equating S3 with tape) or leaving out entire categories of cost like the time someone tried to say their bare metal costs for a two server database cluster was comparable to RDS despite not even having things like power or backups.
As far as I know, nothing comes close to Aurora functionality. Even in vibecoding world. No, 'apt-get install postgres' is not enough.
What you’re asking for can mostly be pieced together, but no, it doesn’t exist as-is.
Failover: this has been a thing for a long time. Set up a synchronous standby, then add a monitoring job that checks heartbeats and promotes the standby when needed. Optionally use something like heartbeat to have a floating IP that gets swapped on failover, or handle routing with pgbouncer / pgcat etc. instead. Alternatively, use pg_auto_failover, which does all of this for you.
Clustering: you mean read replicas?
Volume-based snaps: assuming you mean CoW snapshots, that’s a filesystem implementation detail. Use ZFS (or btrfs, but I wouldn’t, personally). Or Ceph if you need a distributed storage solution, but I would definitely not try to run Ceph in prod unless you really, really know what you’re doing. Lightbits is another solution, but it isn’t free (as in beer).
Cross-region replication: this is just replication? It doesn’t matter where the other node[s] are, as long as they’re reachable, and you’ve accepted the tradeoffs of latency (synchronous standbys) or potential data loss (async standbys).
Metrics: Percona Monitoring & Management if you want a dedicated DB-first, all-in-one monitoring solution, otherwise set up your own scrapers and dashboards in whatever you’d like.
What you will not get from this is Aurora’s shared cluster volume. I personally think that’s a good thing, because I think separating compute from storage is a terrible tradeoff for performance, but YMMV. What that means is you need to manage disk utilization and capacity, as well as properly designing your failure domain. For example, if you have a synchronous standby, you may decide that you don’t care if a disk dies, so no messing with any kind of RAID (though you’d then miss out on ZFS’ auto-repair from bad checksums). As long as this aligns with your failure domain model, it’s fine - you might have separate physical disks, but co-locate the Postgres instances in a single physical server (…don’t), or you might require separate servers, or separate racks, or separate data centers, etc.
tl;dr you can fairly closely replicate the experience of Aurora, but you’ll need to know what you’re doing. And frankly, if you don’t, even if someone built a OSS product that does all of this, you shouldn’t be running it in prod - how will you fix issues when they crop up?
Nobody doubts one could build something similar to Aurora given enough budget, time, and skills.
But that's not replicating the experience of Aurora. The experience of Aurora is I can have all of that, in like 30 lines of terraform and a few minutes. And then I don't need to worry about managing the zpools, I don't need to ensure the heartbeats are working fine, I don't need to worry about hardware failures (to a large extent), I don't need to drive to multiple different physical locations to set up the hardware, I don't need to worry about handling patching, etc.
You might replicate the features, but you're not replicating the experience.
Managed services have a clear value proposition. I personally think they're grossly overpriced, but I understand the appeal. Asking for that experience but also free / cheap doesn't make any sense.
If ECS is faster, then you're more satisfied with AWS and less likely to migrate. You're also open to additional services that might bring up the spend (e.g. ECS Container Insights or X-Ray)
Source: Former Amazon employee
We used EFS to solve that issue, but it was very awkward, expensive and slow, its certainly not meant for that.
My biggest gripe with this is async tasks where the app does numerous hijinks to avoid a 10 minute lambda processing timeout. Rather than structure the process to process many independent and small batches, or simply using a modest container to do the job in a single shot - a myriad of intermediate steps are introduced to write data to dynamo/s3/kinesis + sqs/and coordination.
A dynamically provisioned, serverless container with 24 cores and 64 GB of memory can happily process GBs of data transformations.
Microservices is a killer with cost. For each microservices pod - you're often running a bunch of side cars - datadog, auth, ingress - you pay massive workload separation overhead with orchestration, management, monitoring and ofc complexity
I am just flabbergasted that this is how we operate as a norm in our industry.
If you can keep 4 "Java boxes" fed with work 80%+ of the time, then sure EC2 is a good fit.
We do a lot of batch processing and save money over having EC2 boxes always on. Sure we could probably pinch some more pennies if we managed the EC2 box uptime and figured out mechanisms for load balancing the batches... But that's engineering time we just don't really care to spend when ECS nets us most of the savings advantage and is simple to reason about and use.
You don’t need colocation to save 4x though. Bandwidth pricing is 10x. EC2 is 2-4x especially outside US. EBS for its iops is just bad.