DevOps & Cloud Concepts
Essential DevOps and cloud computing concepts every engineer should know. HA, scaling, deployment strategies, disaster recovery, and more.
Why This Module?
Before diving into specific AWS services, it helps to understand the foundational concepts that underpin every cloud architecture. These ideas are cloud-agnostic โ they apply equally to AWS, Azure, and GCP.
๐๏ธ High Availability
Design systems that keep running even when components fail.
โ๏ธ Load Balancing
Distribute traffic to prevent overload and improve response times.
๐ Autoscaling
Automatically adjust capacity to match demand โ up and down.
๐ Deployment Strategies
Ship changes safely with blue/green, canary, and rolling deployments.
๐ก๏ธ Disaster Recovery
Recover from catastrophic failures with defined RPO and RTO targets.
๐๏ธ Observability
Understand what's happening inside your systems with metrics, logs, and traces.
High Availability (HA)
High availability means your system continues operating when components fail. It's measured as a percentage of uptime:
| Availability | Downtime / Year | Downtime / Month | Typical Use |
|---|---|---|---|
| 99% ("two nines") | 3.65 days | 7.3 hours | Internal tools |
| 99.9% ("three nines") | 8.76 hours | 43 min | Typical SaaS |
| 99.99% ("four nines") | 52.6 min | 4.3 min | E-commerce, banking |
| 99.999% ("five nines") | 5.26 min | 26 sec | Mission-critical |
Key HA Patterns
- Redundancy: No single point of failure. Run 2+ instances, across 2+ Availability Zones
- Health checks: Continuously monitor components and automatically replace unhealthy ones
- Failover: Automatic promotion of standby systems (e.g., RDS Multi-AZ)
- Statelessness: Store session data externally (Redis/DynamoDB) so any instance can serve any request
The HA equation: To get 99.99% availability from components that are individually 99.9% available, you need redundancy. Two independent components in an active-active setup: 1 - (0.001 ร 0.001) = 99.9999%.
HA on AWS
Single instance (no HA):
User โ EC2 โ RDS
Availability: ~99.5%
Basic HA (Multi-AZ):
User โ ALB โ [EC2 AZ-a, EC2 AZ-b] โ RDS Multi-AZ
Availability: ~99.99%
Full HA (Multi-Region):
User โ Route 53 (failover) โ [Region A: ALB โ ASG โ Aurora]
[Region B: ALB โ ASG โ Aurora Replica]
Availability: ~99.999%Load Balancing
A load balancer distributes incoming requests across multiple targets to prevent any single server from becoming a bottleneck.
Load Balancing Algorithms
๐ Round Robin
Sends requests to each server in sequence. Simple but doesn't account for server load. Default for many LBs.
โก Least Connections
Routes to the server with fewest active connections. Better for uneven request durations.
๐ IP Hash
Same client IP always goes to the same server. Useful for session affinity (sticky sessions).
โ๏ธ Weighted
Assign weights to servers. Useful during canary deployments (send 5% to new version).
AWS Load Balancer Types
| Type | Layer | Best For | Key Feature |
|---|---|---|---|
| ALB | Layer 7 (HTTP) | Web apps, microservices | Path/host-based routing |
| NLB | Layer 4 (TCP/UDP) | Gaming, IoT, ultra-low latency | Millions of req/s, static IP |
| GWLB | Layer 3 (IP) | Firewalls, IDS/IPS | Transparent network appliances |
Rule of thumb: If your app speaks HTTP โ ALB. If you need raw TCP/UDP performance โ NLB. You'll use ALB 90% of the time.
Health Checks
Load balancers rely on health checks to know which targets are alive. A health check sends a request (e.g., GET /health) at regular intervals. If a target fails consecutive checks, it's removed from rotation.
Health Check Config:
Path: /health
Interval: 30 seconds
Healthy threshold: 2 consecutive successes โ mark healthy
Unhealthy threshold: 3 consecutive failures โ mark unhealthy
Timeout: 5 seconds per checkAutoscaling
Autoscaling automatically adjusts the number of compute instances based on demand. It scales out (add instances) under load and scales in (remove instances) when demand drops.
Scaling Types
| Type | How It Works | Example |
|---|---|---|
| Horizontal (scale out/in) | Add/remove instances | ASG: 2 โ 6 EC2s during peak |
| Vertical (scale up/down) | Resize instance | t3.micro โ t3.large (requires restart) |
| Scheduled | Scale at known times | Scale up at 8am, down at 8pm |
| Predictive | ML-based forecasting | AWS learns your traffic patterns |
Scaling Policies
Target Tracking (recommended):
"Keep average CPU at 60%"
โ ASG automatically adds/removes instances to maintain target
โ Simple, self-managing
Step Scaling:
CPU > 60% โ add 1 instance
CPU > 80% โ add 3 instances
CPU < 30% โ remove 1 instance
โ More control, more config
Simple Scaling:
CPU > 70% โ add 1 instance, then wait 300s cooldown
โ Basic, legacy โ prefer Target TrackingCooldown period: After a scaling action, the ASG waits (default 300s) before acting again. This prevents "thrashing" โ rapidly scaling up and down. Target Tracking handles this automatically.
Scaling on AWS โ The Key Services
- EC2 Auto Scaling Groups: Scale EC2 instances horizontally
- ECS Service Auto Scaling: Scale container tasks based on CPU/memory or custom metrics
- K8s HPA: Scale pods on EKS based on metrics
- Lambda: Scales automatically with no config (up to account concurrency limit)
- DynamoDB: Auto-scales read/write capacity units
- Aurora: Auto-scales read replicas and storage
Deployment Strategies
How you ship changes to production determines your risk, speed, and rollback capabilities. Here are the main strategies:
1. Rolling Deployment
Time โ
Instance 1: [v1] [v1] [v2] [v2] [v2]
Instance 2: [v1] [v1] [v1] [v2] [v2]
Instance 3: [v1] [v1] [v1] [v1] [v2]
โ
Low cost (no extra instances)
โ
Gradual rollout
โ During update, mix of v1 and v2 serving traffic
โ Slow rollback (must re-deploy v1)2. Blue/Green Deployment
โโโ Blue (v1) โ LIVE
LB โโโค
โโโ Green (v2) โ Idle, testing
After validation:
โโโ Blue (v1) โ Idle (keep for rollback)
LB โโโค
โโโ Green (v2) โ LIVE โ
โ
Instant rollback (switch back to blue)
โ
Zero-downtime
โ
Full testing before go-live
โ Costs 2x resources during deployment3. Canary Deployment
Time โ
Phase 1: 5% traffic โ v2, 95% โ v1 (test with small %)
Phase 2: 25% traffic โ v2, 75% โ v1 (if metrics OK, increase)
Phase 3: 100% traffic โ v2 (full rollout)
โ
Lowest risk โ errors affect few users
โ
Data-driven decision (watch error rates)
โ Most complex to implement
โ Need good observability to detect issues4. A/B Testing (Traffic Splitting)
Similar to canary but the split is for feature comparison, not just safety. Route specific user segments to different versions and measure business metrics (conversion rate, engagement).
Which to use? Start with rolling for simplicity. Graduate to blue/green for critical services. Use canarywhen you have mature observability and need the safest possible rollouts.
AWS Implementation
| Strategy | AWS Service | How |
|---|---|---|
| Rolling | CodeDeploy (OneAtATime) | Update instances sequentially |
| Blue/Green | CodeDeploy + ASG | Create new ASG, swap ALB target |
| Canary | CodeDeploy (Canary10Percent5Min) | Route 10% first, full after 5 min |
| Canary (K8s) | Flagger + Istio on EKS | Progressive delivery with metrics |
| Blue/Green (Lambda) | Lambda Aliases + Weights | Traffic shift between versions |
Disaster Recovery (DR)
DR is your plan for recovering from catastrophic failures โ region outages, data corruption, or complete infrastructure loss. Two critical metrics define DR:
โฑ๏ธ RTO (Recovery Time Objective)
How fast you need to recover. "We must be back online within 1 hour."
๐พ RPO (Recovery Point Objective)
How much data loss is acceptable. "We can lose at most 15 minutes of data."
DR Strategies (by cost & speed)
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | ๐ฐ | Regular backups to S3. Restore when needed. |
| Pilot Light | 10-30 min | Minutes | ๐ฐ๐ฐ | Core infra running but idle. Scale up on disaster. |
| Warm Standby | Minutes | Seconds | ๐ฐ๐ฐ๐ฐ | Scaled-down copy running in DR region. Scale up on failover. |
| Multi-Site Active-Active | Near-zero | Near-zero | ๐ฐ๐ฐ๐ฐ๐ฐ | Full copies in both regions. Route 53 health checks failover. |
Example: Warm Standby
Primary Region (us-east-1): DR Region (us-west-2):
โโ ALB โ ASG (4 instances) โโ ALB โ ASG (1 instance)
โโ RDS Primary โโ RDS Read Replica
โโ ElastiCache โโ Route 53 health check
โโ S3 (cross-region replication)
On failure:
1. Route 53 detects primary unhealthy
2. DNS fails over to us-west-2
3. ASG scales from 1 โ 4 instances
4. RDS replica promoted to primary
5. ~5 minute recovery timeTest your DR plan! An untested DR plan is not a plan โ it's a hope. Run DR drills quarterly. AWS Fault Injection Simulator (FIS) can simulate AZ and region failures.
Observability
Observability answers: "What is my system doing and why?"It's built on three pillars:
The Three Pillars
๐ Metrics
Numeric measurements over time. CPU at 72%, response time p99 = 230ms, 5xx rate = 0.3%. Used for alerting and dashboards.
๐ Logs
Timestamped event records. "User 123 failed login at 14:32:01." Used for debugging specific issues.
๐ Traces
Follow a request across services. Request โ API โ Auth Service โ DB โ Cache. Used for finding bottlenecks in distributed systems.
AWS Observability Stack
| Pillar | AWS Service | Alternative |
|---|---|---|
| Metrics | CloudWatch Metrics | Datadog, Prometheus + Grafana |
| Logs | CloudWatch Logs | ELK Stack, Datadog Logs |
| Traces | X-Ray | Jaeger, Datadog APM |
| Dashboards | CloudWatch Dashboards | Grafana |
| Alerting | CloudWatch Alarms โ SNS | PagerDuty, OpsGenie |
The Golden Signals
Google's SRE book defines four signals every service should monitor:
- Latency: How long requests take (track p50, p95, p99)
- Traffic: How many requests per second (demand)
- Errors: Rate of failed requests (5xx, timeouts)
- Saturation: How "full" your system is (CPU, memory, disk, connections)
Alert on symptoms, not causes. Alert on "5xx rate > 5%" (symptom), not "CPU > 80%" (cause). High CPU that doesn't affect users isn't an emergency. High error rates always are.
Immutable Infrastructure
Instead of updating servers in place (mutable), you replace them entirelywith new versions. Think of servers like cattle, not pets.
๐ Pets (Mutable)
Hand-configured servers you care for individually. SSH in, install packages, tweak configs. Unique snowflakes.
๐ Cattle (Immutable)
Identical, disposable instances from a template. If one fails, terminate and launch a new one. No SSH needed.
Benefits of Immutable Infrastructure
- No configuration drift: Every instance is identical โ built from the same AMI/image
- Reliable rollbacks: Deploy old version = launch old AMI
- Easier debugging: "It works in staging" actually means something when staging = production image
- Security: No SSH access needed. Smaller attack surface
The Immutable Pipeline
Code Change โ Build โ Create AMI/Image โ Deploy New Instances โ Terminate Old
Developer pushes code
โ
โผ
CodeBuild: npm install, test, build
โ
โผ
Packer: Bake AMI with app + dependencies
โ
โผ
Terraform/CloudFormation: Update Launch Template with new AMI
โ
โผ
ASG: Rolling replacement of old instances with new onesMore Key Concepts
Infrastructure as Code (IaC)
Manage infrastructure through code instead of manual console clicks. Version-controlled, reviewable, repeatable, and testable. CloudFormation and Terraform are the two main tools (covered in Module 5).
GitOps
Git is the single source of truth for infrastructure and application state. Changes are made via Pull Requests. A reconciliation tool (ArgoCD, Flux) ensures the live state matches what's in git.
Developer โ PR to git โ Approved โ Merged
โ
ArgoCD watches git โ Detects change โ Applies to K8s cluster
โ
Cluster state = Git state (always)12-Factor App Principles (for Cloud-Native)
| # | Factor | In Practice |
|---|---|---|
| 1 | Codebase | One repo per service, tracked in git |
| 2 | Dependencies | Explicitly declared (package.json, requirements.txt) |
| 3 | Config | Store in environment variables, not code |
| 4 | Backing services | Treat databases, queues as attached resources |
| 5 | Build, release, run | Separate build/release/run stages (CI/CD) |
| 6 | Processes | Stateless โ store state in Redis/DB, not memory |
| 7 | Port binding | Export services via port (e.g., Express on :3000) |
| 8 | Concurrency | Scale via processes (horizontal scaling) |
| 9 | Disposability | Fast startup, graceful shutdown |
| 10 | Dev/prod parity | Keep environments as similar as possible |
| 11 | Logs | Treat as event streams (stdout โ CloudWatch) |
| 12 | Admin processes | Run admin tasks (migrations) as one-off processes |
Idempotency
An operation is idempotent if running it multiple times produces the same result as running it once. This is critical for:
- IaC:
terraform applycan be run repeatedly without side effects - APIs: Retrying a PUT request doesn't create duplicate records
- Deployments: Re-running a deployment script doesn't break the existing setup
- Ansible playbooks: Run 10 times, same result every time
Principle of Least Privilege
Grant only the minimum permissions needed for a task. Every IAM role, security group, and NACL should follow this principle. If a service only reads from S3, don't give it write access.
Exercise: Map Concepts to the Sandbox Architecture
Look at the AWS Sandbox architecture and identify where each concept applies:
Architecture:
Route 53 โ CloudFront โ ALB โ ASG (2-4 EC2s) โ RDS Multi-AZ
โ High Availability: ALB across 2 AZs + RDS Multi-AZ failover
โ Load Balancing: ALB distributes to EC2 Target Group
โ Autoscaling: ASG Target Tracking (CPU 70%)
โ Health Checks: ALB โ /health endpoint every 30s
โ Immutable Infra: Launch Template + AMI, new instances replace old
โ Deployment Strategy: CodeDeploy OneAtATime (rolling)
โ IaC: CloudFormation 7 stacks / Terraform modules
โ Observability: CloudWatch metrics + ALB access logs
โ Least Privilege: SG chain (ALB โ EC2 โ RDS)
โ Statelessness: App stores nothing in memory, all in RDSKey Takeaways
- HA = redundancy + health checks + auto-failover across multiple AZs
- Autoscaling = use Target Tracking for simplicity, always include scale-in
- Deployment strategies = rolling for simplicity, blue/green for safety, canary for precision
- DR = define RTO/RPO first, then choose strategy. Test your DR plan!
- Observability = metrics + logs + traces. Alert on symptoms, not causes
- Immutable infra = cattle not pets. Replace, don't repair
- These concepts are cloud-agnostic โ they apply to AWS, Azure, GCP, and on-prem