Understanding Availability Zones and Regions: A Practical Guide
Learn what Availability Zones and regions are physically, how to design for AZ redundancy, which services are zone-scoped vs region-scoped, and what SLA documents actually guarantee.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

The Physical Foundation of Cloud Reliability
Availability Zones (AZs) and regions are the building blocks of every cloud architecture decision you'll make. When AWS promises 99.99% uptime for a service, that guarantee assumes you've deployed across multiple AZs. When your application goes down because "the data center had an issue," it usually means you put everything in a single AZ. Understanding what AZs and regions actually are -- physically and logically -- changes how you design infrastructure.
This isn't abstract theory. Every production outage I've investigated in the past five years had a root cause tied to AZ or region architecture: a database in one AZ with no replica, a load balancer not spanning enough zones, or a region-scoped service dependency that took everything down. This guide covers the physical reality, the design patterns, and the failure modes you need to plan for.
What Are Availability Zones?
Definition: An Availability Zone is one or more discrete data centers within a cloud region, each with independent power, cooling, and networking. AZs within a region are connected by high-bandwidth, low-latency links and are designed so that a failure in one AZ does not cascade to others.
What Are Regions?
A region is a geographic area containing multiple Availability Zones. AWS has 30+ regions worldwide. GCP calls them regions with zones. Azure calls them regions with Availability Zones (added later in Azure's evolution). Each region is completely independent -- a region-level outage in us-east-1 doesn't affect eu-west-1.
The Physical Reality
Cloud providers are deliberately vague about the physical details, but here's what we know:
- Each AZ consists of one or more data centers (AWS has confirmed some AZs have multiple buildings)
- AZs within a region are typically 10-100 km apart -- far enough for independent failure domains but close enough for low-latency replication
- Inter-AZ latency is usually 1-2ms round trip
- Each AZ has independent power feeds, often from different substations or power grids
- AZs have independent network connections to the internet and to AWS's backbone
Pro tip: AZ names are randomized per AWS account. Your
us-east-1ais not the same physical facility as another account'sus-east-1a. AWS does this to distribute load evenly. Use AZ IDs (likeuse1-az1) when coordinating across accounts.
Cross-Cloud Comparison
| Concept | AWS | GCP | Azure |
|---|---|---|---|
| Geographic unit | Region | Region | Region (Geography) |
| Fault domain | Availability Zone | Zone | Availability Zone |
| AZs per region | 2-6 (typically 3) | 3 (typically) | 3 (in supported regions) |
| Total regions | 30+ | 40+ | 60+ |
| AZ-level SLA | No (region-level with multi-AZ) | No (region-level with multi-zone) | Yes (per-zone SLA for VMs) |
GCP's zones are functionally identical to AWS AZs. Azure added Availability Zones later, so some older Azure regions don't have them -- check before assuming zone redundancy is available.
Designing for AZ Redundancy: Step by Step
Step 1: Deploy Compute Across at Least Two AZs
Every Auto Scaling Group, ECS service, or Kubernetes deployment should span at least two AZs (three is better). If one AZ fails, the surviving AZs handle the load. Your ASG should have enough capacity in N-1 AZs to serve your full traffic.
# Terraform: ASG spanning three AZs
resource "aws_autoscaling_group" "app" {
min_size = 3
max_size = 12
desired_capacity = 6
vpc_zone_identifier = [
aws_subnet.private_a.id,
aws_subnet.private_b.id,
aws_subnet.private_c.id,
]
}
Step 2: Use Multi-AZ Databases
RDS Multi-AZ deploys a synchronous standby replica in a different AZ. If the primary fails, RDS automatically promotes the standby -- typically in 60-120 seconds. Aurora goes further with up to 15 read replicas across three AZs and failover in under 30 seconds.
Step 3: Distribute Load Balancers
Application Load Balancers (ALBs) automatically distribute across the AZs you enable. Enable all AZs in your VPC. Cross-zone load balancing is on by default for ALBs, meaning each AZ's targets receive traffic regardless of which AZ the client connects to.
Step 4: Replicate Data Stores
ElastiCache Redis can run in cluster mode with replicas across AZs. DynamoDB replicates across three AZs automatically (it's built in). EFS is multi-AZ by default. EBS volumes are AZ-scoped -- if you need cross-AZ data availability, use EFS, S3, or database replication.
Watch out: EBS volumes exist in a single AZ. If that AZ fails, the volume is inaccessible until the AZ recovers. Never rely on a single EBS volume for critical data without snapshots or cross-AZ replication.
Zone-Scoped vs Region-Scoped Services
Understanding which AWS services are zone-scoped vs region-scoped determines your blast radius:
| Zone-Scoped (affected by AZ failure) | Region-Scoped (survives AZ failure) |
|---|---|
| EC2 instances | S3 |
| EBS volumes | DynamoDB |
| NAT Gateways | Lambda |
| ElastiCache nodes | SQS |
| RDS single-AZ instances | ALB (multi-AZ) |
| Subnets | Route 53 |
Region-scoped services replicate across AZs internally. When you write an object to S3, it's replicated across at least three AZs before the write is acknowledged. DynamoDB does the same. These services survive AZ failures transparently.
Failure Modes and SLAs
AZ Failure
Partial or complete AZ failures happen roughly once per year across all AWS regions. They're usually partial -- a power issue affecting some racks, a network partition isolating some hosts. Designing for multi-AZ means these events cause degraded performance (fewer healthy targets) rather than full outages.
Region Failure
Full region failures are extremely rare -- a handful of incidents in AWS's history. However, region-wide service degradation (like the us-east-1 events in 2017 and 2020) affects multiple services simultaneously. If your SLA requires resilience against region failure, you need a multi-region architecture.
Understanding SLA Documents
AWS SLAs are carefully worded. Key points:
- EC2 SLA: 99.99% for instances deployed across two or more AZs. Single-AZ instances have no availability SLA.
- RDS Multi-AZ SLA: 99.95%. Single-AZ RDS: no SLA.
- S3 SLA: 99.9% availability, 99.999999999% (11 nines) durability.
- SLA credits are billing credits, not refunds. A 10% credit on your EC2 bill doesn't cover the revenue lost from a 4-hour outage.
Cost of Multi-AZ Deployment
Multi-AZ adds cost, but it's modest compared to multi-region:
- RDS Multi-AZ: roughly 2x the cost of single-AZ (you're running two instances)
- Cross-AZ data transfer: $0.01/GB each direction -- monitor this for high-throughput workloads
- NAT Gateways per AZ: $32/month each -- three AZs means $96/month just for NAT
- Additional compute capacity: running N+1 instances across AZs for redundancy
For most workloads, multi-AZ adds 20-40% to your infrastructure cost. It's almost always worth it for production -- the cost of downtime exceeds the cost of redundancy.
Frequently Asked Questions
How many Availability Zones should I use?
Use at least two AZs for any production workload. Three AZs is the standard recommendation because it provides redundancy even if one AZ fails and another is under maintenance. Using three AZs also enables quorum-based systems (like etcd in Kubernetes) to tolerate a single AZ failure while maintaining consensus.
Do all AWS services support multi-AZ?
Most managed services either support multi-AZ explicitly (RDS, ElastiCache) or are region-scoped and handle it internally (S3, DynamoDB, Lambda, SQS). Some services like EBS and EC2 are inherently zone-scoped. Always check the service documentation for multi-AZ support before assuming it's there.
What is the latency between Availability Zones?
Inter-AZ latency within an AWS region is typically 1-2 milliseconds round trip. This is low enough for synchronous database replication and most application communication. Cross-region latency is much higher -- typically 20-200ms depending on geographic distance -- which is why multi-region adds complexity to data consistency.
Can an entire AWS region go down?
Full region outages are extremely rare but have occurred. More commonly, specific services within a region degrade -- for example, the 2020 Kinesis outage in us-east-1 cascaded to affect Cognito, CloudWatch, and Lambda. Design critical paths to avoid dependencies on multiple services in a single region if region resilience matters to you.
What is the difference between an AZ and a data center?
An AZ is a logical concept that maps to one or more physical data centers. AWS has confirmed that some AZs contain multiple data center buildings connected by dedicated fiber. The key distinction is that an AZ is designed as a single failure domain -- all buildings in one AZ share fate for power, networking, and physical events.
How do I check which AZ a resource is in?
EC2 instances show their AZ in the console and CLI. Use aws ec2 describe-instances and check the Placement.AvailabilityZone field. For cross-account coordination, use AZ IDs instead of AZ names since names are randomized per account. The aws ec2 describe-availability-zones command shows the mapping between names and IDs.
Multi-AZ Is Your Baseline
Single-AZ deployment is acceptable for development and testing. For anything that serves users or processes data you care about, deploy across at least two AZs -- preferably three. Use RDS Multi-AZ or Aurora for databases. Span your ASGs and ECS services across all available zones. Monitor cross-AZ data transfer costs but don't let them discourage multi-AZ design. The availability gains are worth the modest cost premium, and every cloud provider SLA assumes you've done this work.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
SSRF Attacks: What They Are and Why Cloud Environments Make Them Dangerous
SSRF lets attackers reach internal services through your server. Learn how cloud metadata endpoints amplify the risk and how to defend against SSRF.
9 min read
SecuritySecret Management: HashiCorp Vault vs AWS Secrets Manager vs Kubernetes Secrets
Compare Vault, AWS Secrets Manager, and Kubernetes Secrets. Learn about dynamic secrets, rotation, injection patterns, and when to use each tool.
9 min read
CloudWhat is a CDN? How CloudFront and Cloudflare Work Under the Hood
Understand how CDNs work at the edge: PoPs, Anycast vs GeoDNS, cache behaviors, Origin Shield, invalidation strategies, and a detailed CloudFront vs Cloudflare comparison with pricing.
9 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.