This is good advice. Ideally, never blend your AZs, each should be an independent stack. Use 3+ not 2, keeps you honest about an availability strategy instead of a standby failover strategy. In front of them, use DNS geo IP or even basic round robin (with service availability check) to get to the NLB. Behind the NLB, stay in that AZ!
If you need to call out of the the AZ for other data or API sources, either figure out which AZs that service is using and configure to stay in them, or, make sure to go all the way back out to a well-balanced (for resilience) endpoint.
Agree. From 6+ years of experience it seems that we got fouled by the multi-az promise of being able to survive datacenter outage.
You can survive datacenter (AZ) outage IF you have separate stacks per AZ and don't mix traffic. If you have Kafka cluster spread out in 3 AZ don't get surprised if you just LOWERED your availability because any issue in one AZ makes your stack unstable. And issues in single AZ are quite common.
A properly configured kafka cluster across 3 AZs _should_ be able to survive the loss of a single AZ. Obviously you should do testing and DR exercises to make sure _your_ cluster and application work in that scenario.
That's a really interesting point. The startup I currently work for only uses a single AZ due to financial concerns (and some performance as well), but I assume we'll have to move to more AZs for reliability.
Would you advise the same for clusters of RDS and Elasticache?
I'm wondering how you would even go about having two separate data sources, how would this be manageable?
Before assuming that your reliability would be increased by adding more AZs, verify where the problems of reliability comes from in the first place. I find more times than not, the down times comes from people applying changes, not when you just leave things running like they are. It's only if the AZ or underlying machines has troubles, that you should start thinking of expanding to other AZs
I've found that for RDS, a writer instance and a hot standby reader instance with automatic failover work pretty well. When a failover happens, you're usually looking at about 30 seconds of downtime, which is "good enough" for most purposes.
30 seconds is pretty good. I worked on an "enterprise" system running AIX and HACMP (IBM's HA software.) A failover event would take minutes... and this was on the same local network.
> From 6+ years of experience it seems that we got fouled by the multi-az promise of being able to survive datacenter outage.
You have quite a misunderstanding ...
AWS' "multi-az promise" has always been that they will try to take only one AZ down at a time within a region.
It was never "blend your AZ usage so we can't take one down."
If you don't have a wiki page with some HA architecture diagrams for each of your systems, then you probably don't have HA. Hint: at every company that I've worked at, I drew the first diagrams. Something to think about.
This is good advice but not always easy to implement. We have some customers that insist on using IPs instead of DNS (usually because of bad/old software on their side). In our case we have some commercial LBs that can pass the EIP to each other as needed. However we do see quite a few resets so I wonder if something like this is still going on.
You can use AWS Global Accelerator. AWS will assign 2 static IPs (not EIPs, but they will never change until the Global Accelerator is deleted) or you can use your own block of IP (BYOIP).
Then resolve the DNS to the Global Accelerator static IPs. And forward your traffic from AWS Global Accelerator to your ALBs, EC2 instances or NLBs.
If you need to call out of the the AZ for other data or API sources, either figure out which AZs that service is using and configure to stay in them, or, make sure to go all the way back out to a well-balanced (for resilience) endpoint.