upvote
The idea that AWS's services are fully regionalized or isolated has always been a myth.

All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.

And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.

reply
Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.

But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.

reply
IAM isn’t even really the most painful dependency. Route53 is. The control plane only runs out of use1.

Better make sure the only DNS operations you run during an outage are data plane queries and health check failovers.

reply
> outside of China

[Nitpick] There are a few more AWS partitions like GovCloud:

https://jasonbutz.info/2023/07/aws-partitions/

reply
Services outside of us-east-1 don’t call us-east-1 for IAM data plane thou right ?
reply
They’re talking about the backbone and what goes on behind the scenes. There have been issues with services in other regions when us-east-1 has issues.

Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.

reply
Isn't this kind of circular dependency what lead to extended downtime a while back?
reply
It reminds me of facebook. Staff was locked out of the office due to the outage they were supposed to fix.
reply
It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.
reply
Yes, I concur.

Sometimes the circular dependencies get almost cartoonishly silly.

Like, "One of the two guys who has the physical keys to the server cage in us-east-1 is on vacation. The other one can't get into his apartment because his smart lock runs into the AWS cloud. So he hires a locksmith, but the locksmith takes an extra two hours to do the job because his reference documents for this model of lock live on an S3 bucket."

I made that example up, but only barely.

reply
We had a pair of machines. And some bright spark set them up to mount each others NFS shares. after a power outage "Holy mother of chicken and egg NFS hangs batman"

That was a weird job, fun, it was a local machine room for a warehouse that originally held the IBM mainframe, it still held it's successor "the multiprise 3000" which has the claim to fame as being the smallest mainframe IBM ever sold. But now the room was also full of decades of artisanal crafted unix servers with pick databases. the pick dev team had done most the system architecture. The best way to understand it is that for them pick is the operating system, unix is a necessary annoyance they have to put up with only because nobody has made pick hardware for 20 years. and it was NFS mounts everywhere, somebody had figured out a trick where they could NFS mount a remote machine and have the local pick system reach in and scrounge through the remote systems data. But strictly read-only. pick got grumpy when writing to NFS not to say anything about how the other database would feel about having it's data being messed with. Thus the circular mount.

Still was not the worst thing I saw. I liked the one system with a SMB mount. "Why is this one SMB?" "Well pick complains when you try to write to a NFS mount, but it's NFS detection code does not trip on SMB mounts." ... Sighs "Um... I am no pick expert but you know why it does not like remote mounts right. SMB does not change that, Do you happen to get a lot of corrupt indexes on this machine?" "yes, how did you know"

reply
Oh, yeah, re-exporting NFS mounts via SMB was very much a thing in the early 2000s - something to do with their different approaches to flock() vs fcntl() handling. If you ran into locking issues with nfs, then re-exporting via SMB was standard advice.

At some point, the behaviour changed and locks starting conflicting. IIRC, we hit it when upgrading to Debian Etch and took the time to unwind the system and make pure NFS work properly for us. Plenty of people took the opposite approach, and fiddled with the config to make locking a noop on SMB. I know of at least one web hosting company who ended up having to restore a year's worth of customer uploads from backups as a result...

reply
A real example, from Facebook's 2021 outage [1]:

> Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.

There was one (later denied) report that a 'guy with an angle grinder' was involved in gaining access to the server cage.

[1] https://news.ycombinator.com/item?id=28762611

reply
Why would such a critical server even be accessible with only one set of keys?

I’ve always thought mission critical stuff needs two independent key holders, with key holes placed far apart enough to make it impossible for 1 person to reach both.

reply
They're not actually accessible with 'only one set of keys' in my experience.

You actually have to present your photo ID at the site entry gatehouse, then again to the building entry guard (who will also check you have a work permit and a site-specific safety induction) then you swipe a badge at a turnstile to get from reception into the stairwell, then swipe your badge at a door to get into the relevant floor, then swipe your badge and key in a code to enter the room with the cages then you use the key.

reply
Other than for certain nuclear missile launches[1], that only happens in the movies.

[1] https://www.nationalmuseum.af.mil/Visit/Museum-Exhibits/Fact...

reply
I dont know how it is in the datacentre industry, but certainly in other industries that is how its done for anything truly mission critical and also easily tampered with.

I guess it shows very few care enough to pay enough to make that a reasonable upgrade.

reply
when you have a circular dependency, one strategy employed, is to have it be circular but interruptible for 18 or so hours. Call it an oh shit bar.

I'm glad I never had to get that deep into the failure chain.

reply
> And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

When you dogfood your own Rube Goldberg machine.

reply
We should let the IAM service team know if this glaring gap the hn thread figured out /s

I’m 99% ;) certain dependencies of foundational services are a well discussed topic

reply
> The idea that AWS's services are fully regionalized or isolated has always been a myth.

This is highly misleading. It's true that there's a handful of global AWS services - but only their control planes operate from a single region (e.g. us-east-1). Their data planes are regionally isolated.[1]

The only time you'd normally use a service control plane is to deploy changes, e.g. when you create, read, update or delete service resources during a change window.

Workloads should be designed for "static stability", as recommended by AWS.[2] A statically stable workload only depends upon the data planes of the services it uses at runtime. Statically stable workloads are designed to continue operating as normal even if there's a service event impairing one or more control planes (including for global services).

> During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is part of the IAM data plane and runs independently in each region.[3]

If the IAM control plane is impaired, you might not be able to create new IAM roles (a control plane operation) - but you can continue generating and using temporary credentials for existing IAM roles (data plane operations) within the region your workload is running in. This allows statically stable workloads to continue using IAM without interruption.

[1] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"Global AWS services still follow the conventional AWS design pattern of separating the control plane and data plane in order to achieve static stability. The significant difference for most global services is that their control plane is hosted in a single AWS Region, while their data plane is globally distributed."

[2] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"...eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads."

[3] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"STS is a data plane-only service that is separate from IAM, and does not depend on the IAM control plane."

reply
People say this, but this this was just a single AZ, and in the last 3 years of running my startup mostly out of use-1, and we've only had one regional outage, and even that was partial, with most instances uneffected.

And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.

reply
>And honestly, everybody else's stuff is in use-1

Yeah, but why put your eggs in that basket? I moved all our services from east to west/oregon a decade ago and haven't looked back.

reply
Not OP, but I do single-region us-east-1 for a few reasons:

1. The severity and frequency of us-east-1 outages are vastly overstated. It's fine. These us-east-1 outages almost never affect us. This one didn't; not even our instances in the affected AZ. Only that recent IAM outage affected us a little bit, and it affected every other region, too, since IAM's control plane is centrally hosted in us-east-1. Everybody's uptime depends on us-east-1.

2. We're physically close to us-east-1 and have Direct Connect. We're 1 millisecond away from us-east-1. It would be silly to connect to us-east-1 and then take a latency hit and pay cross-region data transfer cost on all traffic to hop over to another region. That would only make sense if we were in both regions, and that is not worth the cost given #1. If we only have a single region, it has to be us-east-1.

3. us-east-1 gets new features first. New AWS features are relevant to us with shocking regularity, and we get it as soon as it's announced.

4. OP is right about the safety in numbers. Our service isn't life-or-death; nobody will die if we're down, so it's just a matter of whether they're upset. When there is a us-east-1 outage, it's headline news and I can link the news report to anyone who asks. That genuinely absolves us every time. When we're down, everybody else is down, too.

reply
90% of customers are located in use-1. Latency to use-1 is more important than being up when everyone else is down.
reply
But it’s okay to be down when the whole internet is down.
reply
none of my stuff is in us-east-1. I chose that specifically 15 years ago. Been a great decision.
reply
Too many people are using it.

In fantasy magic dream land loads are distributed evenly across different cloud providers.

A single point of failure doesn't exist.

It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.

Healthcare in the US is affordable.

All types of magical stuff exist here.

But no. It's another day. AWS US-East 1 can take town most of the internet.

reply
Core AWS services use it too. Even if you are hosted in another region, you can still be affected by a US-East 1 outage
reply
The idea would be to actually load distribute between different cloud providers.

But even then , the load balancer needs to run somewhere. Which becomes a new single point of failure.

I’m sure someone smarter than me has figured this out.

reply
yes, they have. It just costs a shit ton of money and is extremely difficult to get the suits to sign off on TWO full 'cloud services' bills. It generally doubles your cost and workload and increases your uptime by a couple hours/year, assuming you don't have bugs that affect one or the other cloud in your deployment stack.

It's basically a wash for almost all organizations for twice the cost and effort.

reply
Ok...

But where does the load balancer actually run. Does load balancer main run on AWS, and load balancer backup on Oracle?

reply
Short TTL DNS or BGP anycast.
reply
also these things don't go down THAT often... well aws, not some others. More uptime that you probably had before. even the stock market takes a few days off every decade. Just ask W.
reply
> not some others.

Looking at Azure and GitHub in particular. ;)

reply
Not really. Your clients can random robin to connection points across providers and move write heads upon connection. If you worry about hard coding you can reduce the surface to a per-context first minimum contact point.
reply
I was surprised recently when setting up cloudfront with aws certs that it forced me to use us-east-1 to provision the certs.
reply
STS is only on us-east-1 I believe
reply
Yep. All of the identity and access management services for the non-China public cloud are in us-east-1. https://news.ycombinator.com/item?id=48071472
reply
All the control plane. Data plane is distributed and roles using iam to access resources can still do so during a control plane outage.
reply
Bingo. This is the one most people don't know about.
reply
> It worked out with my first girlfriend. The twins are fluent in English and Korean.

You were dating twins as a form of redundancy?!

reply
anecdotally (well, more "second-hand-ly i heard that..." it sounds like there were some carry-on effects on us-east-2 as a result of people migrating over from us-east-1, so, yeah... kinda hilarious how the multiple region / AZ thing is just so plainly a façade, but yet we all seem to just collectively believe in it as an article of faith in the Cloud Religion... or whatever...
reply
It's no magic given the size of us-east-1 there is no spare capacity to absorb all the workloads
reply
One of the SRE tricks is to reserve your capacity so when the cloud runs out of capacity you're still covered. It's expensive, but you don't want to get stuck without a server when the on-demand dries up.
reply
Is it really failing more, or we just don’t hear about failure happening elsewhere?

Last i heard azure outage it wasn’t even on HN frontpage

reply
It really is failing more, and it’s well known amongst industry experts. It’s the oldest, largest, and most utilized region of AWS.

I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.

It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.

reply
What kind of reputation does ca-central-1 have? I’ve been using it and it seems quietly excellent. Knock on wood.
reply
It wasn't heavily utilized when I worked at AWS, until 2024.

If your customers are clusterrd in Toronto and Montreal, it probably makes a lot of sense to use ca-central-1. If you've got a lot of customers in Western Canada, us-west-2 is gonna have better network latency.

Other than a couple regions that had problems with their local network infrastructure (sa-east-1 was like that), there's little or nothing to differentiate the regions in terms of physical infrastructure and architecture.

reply
Most of the other regions are fairly stable. Ohio (us-east-2) is a great choice if you're just starting out. Not sure about ca-central-1, but I've never heard anything bad about it.
reply
I've always been impressed by Amazon's ability to present the shittiest experience possible and imply the blame is with things like isolation that they don't really provide.
reply