Module 10

DevOps & Cloud Concepts

Essential DevOps and cloud computing concepts every engineer should know. HA, scaling, deployment strategies, disaster recovery, and more.

High AvailabilityAutoscalingBlue/GreenDRObservability

Why This Module?

Before diving into specific AWS services, it helps to understand the foundational concepts that underpin every cloud architecture. These ideas are cloud-agnostic — they apply equally to AWS, Azure, and GCP.

🏗️ High Availability

Design systems that keep running even when components fail.

⚖️ Load Balancing

Distribute traffic to prevent overload and improve response times.

📈 Autoscaling

Automatically adjust capacity to match demand — up and down.

🚀 Deployment Strategies

Ship changes safely with blue/green, canary, and rolling deployments.

🛡️ Disaster Recovery

Recover from catastrophic failures with defined RPO and RTO targets.

👁️ Observability

Understand what's happening inside your systems with metrics, logs, and traces.

High Availability (HA)

High availability means your system continues operating when components fail. It's measured as a percentage of uptime:

Availability	Downtime / Year	Downtime / Month	Typical Use
99% ("two nines")	3.65 days	7.3 hours	Internal tools
99.9% ("three nines")	8.76 hours	43 min	Typical SaaS
99.99% ("four nines")	52.6 min	4.3 min	E-commerce, banking
99.999% ("five nines")	5.26 min	26 sec	Mission-critical

Key HA Patterns

Redundancy: No single point of failure. Run 2+ instances, across 2+ Availability Zones
Health checks: Continuously monitor components and automatically replace unhealthy ones
Failover: Automatic promotion of standby systems (e.g., RDS Multi-AZ)
Statelessness: Store session data externally (Redis/DynamoDB) so any instance can serve any request

📘 Key Concept

The HA equation: To get 99.99% availability from components that are individually 99.9% available, you need redundancy. Two independent components in an active-active setup: 1 - (0.001 × 0.001) = 99.9999%.

HA on AWS

text

Single instance (no HA):
  User → EC2 → RDS
  Availability: ~99.5%

Basic HA (Multi-AZ):
  User → ALB → [EC2 AZ-a, EC2 AZ-b] → RDS Multi-AZ
  Availability: ~99.99%

Full HA (Multi-Region):
  User → Route 53 (failover) → [Region A: ALB → ASG → Aurora]
                               [Region B: ALB → ASG → Aurora Replica]
  Availability: ~99.999%

Load Balancing

A load balancer distributes incoming requests across multiple targets to prevent any single server from becoming a bottleneck.

Load Balancing Algorithms

🔄 Round Robin

Sends requests to each server in sequence. Simple but doesn't account for server load. Default for many LBs.

⚡ Least Connections

Routes to the server with fewest active connections. Better for uneven request durations.

📌 IP Hash

Same client IP always goes to the same server. Useful for session affinity (sticky sessions).

⚖️ Weighted

Assign weights to servers. Useful during canary deployments (send 5% to new version).

AWS Load Balancer Types

Type	Layer	Best For	Key Feature
ALB	Layer 7 (HTTP)	Web apps, microservices	Path/host-based routing
NLB	Layer 4 (TCP/UDP)	Gaming, IoT, ultra-low latency	Millions of req/s, static IP
GWLB	Layer 3 (IP)	Firewalls, IDS/IPS	Transparent network appliances

💡 Tip

Rule of thumb: If your app speaks HTTP → ALB. If you need raw TCP/UDP performance → NLB. You'll use ALB 90% of the time.

Health Checks

Load balancers rely on health checks to know which targets are alive. A health check sends a request (e.g., GET /health) at regular intervals. If a target fails consecutive checks, it's removed from rotation.

text

Health Check Config:
  Path:              /health
  Interval:          30 seconds
  Healthy threshold: 2 consecutive successes → mark healthy
  Unhealthy threshold: 3 consecutive failures → mark unhealthy
  Timeout:           5 seconds per check

Autoscaling

Autoscaling automatically adjusts the number of compute instances based on demand. It scales out (add instances) under load and scales in (remove instances) when demand drops.

Scaling Types

Type	How It Works	Example
Horizontal (scale out/in)	Add/remove instances	ASG: 2 → 6 EC2s during peak
Vertical (scale up/down)	Resize instance	t3.micro → t3.large (requires restart)
Scheduled	Scale at known times	Scale up at 8am, down at 8pm
Predictive	ML-based forecasting	AWS learns your traffic patterns

Scaling Policies

text

Target Tracking (recommended):
  "Keep average CPU at 60%"
  → ASG automatically adds/removes instances to maintain target
  → Simple, self-managing

Step Scaling:
  CPU > 60% → add 1 instance
  CPU > 80% → add 3 instances
  CPU < 30% → remove 1 instance
  → More control, more config

Simple Scaling:
  CPU > 70% → add 1 instance, then wait 300s cooldown
  → Basic, legacy — prefer Target Tracking

📘 Key Concept

Cooldown period: After a scaling action, the ASG waits (default 300s) before acting again. This prevents "thrashing" — rapidly scaling up and down. Target Tracking handles this automatically.

Scaling on AWS — The Key Services

EC2 Auto Scaling Groups: Scale EC2 instances horizontally
ECS Service Auto Scaling: Scale container tasks based on CPU/memory or custom metrics
K8s HPA: Scale pods on EKS based on metrics
Lambda: Scales automatically with no config (up to account concurrency limit)
DynamoDB: Auto-scales read/write capacity units
Aurora: Auto-scales read replicas and storage

Deployment Strategies

How you ship changes to production determines your risk, speed, and rollback capabilities. Here are the main strategies:

1. Rolling Deployment

text

Time →
Instance 1:  [v1] [v1] [v2] [v2] [v2]
Instance 2:  [v1] [v1] [v1] [v2] [v2]
Instance 3:  [v1] [v1] [v1] [v1] [v2]

✅ Low cost (no extra instances)
✅ Gradual rollout
❌ During update, mix of v1 and v2 serving traffic
❌ Slow rollback (must re-deploy v1)

2. Blue/Green Deployment

text

                    ┌── Blue (v1) ← LIVE
LB ──┤
                    └── Green (v2) ← Idle, testing

After validation:

                    ┌── Blue (v1) ← Idle (keep for rollback)
LB ──┤
                    └── Green (v2) ← LIVE ✓

✅ Instant rollback (switch back to blue)
✅ Zero-downtime
✅ Full testing before go-live
❌ Costs 2x resources during deployment

3. Canary Deployment

text

Time →
Phase 1:  5% traffic → v2, 95% → v1    (test with small %)
Phase 2:  25% traffic → v2, 75% → v1   (if metrics OK, increase)
Phase 3:  100% traffic → v2             (full rollout)

✅ Lowest risk — errors affect few users
✅ Data-driven decision (watch error rates)
❌ Most complex to implement
❌ Need good observability to detect issues

4. A/B Testing (Traffic Splitting)

Similar to canary but the split is for feature comparison, not just safety. Route specific user segments to different versions and measure business metrics (conversion rate, engagement).

💡 Tip

Which to use? Start with rolling for simplicity. Graduate to blue/green for critical services. Use canarywhen you have mature observability and need the safest possible rollouts.

AWS Implementation

Strategy	AWS Service	How
Rolling	CodeDeploy (OneAtATime)	Update instances sequentially
Blue/Green	CodeDeploy + ASG	Create new ASG, swap ALB target
Canary	CodeDeploy (Canary10Percent5Min)	Route 10% first, full after 5 min
Canary (K8s)	Flagger + Istio on EKS	Progressive delivery with metrics
Blue/Green (Lambda)	Lambda Aliases + Weights	Traffic shift between versions

Disaster Recovery (DR)

DR is your plan for recovering from catastrophic failures — region outages, data corruption, or complete infrastructure loss. Two critical metrics define DR:

⏱️ RTO (Recovery Time Objective)

How fast you need to recover. "We must be back online within 1 hour."

💾 RPO (Recovery Point Objective)

How much data loss is acceptable. "We can lose at most 15 minutes of data."

DR Strategies (by cost & speed)

Strategy	RTO	RPO	Cost	Description
Backup & Restore	Hours	Hours	💰	Regular backups to S3. Restore when needed.
Pilot Light	10-30 min	Minutes	💰💰	Core infra running but idle. Scale up on disaster.
Warm Standby	Minutes	Seconds	💰💰💰	Scaled-down copy running in DR region. Scale up on failover.
Multi-Site Active-Active	Near-zero	Near-zero	💰💰💰💰	Full copies in both regions. Route 53 health checks failover.

text

Example: Warm Standby

Primary Region (us-east-1):         DR Region (us-west-2):
┌─ ALB → ASG (4 instances)       ┌─ ALB → ASG (1 instance)
├─ RDS Primary                   ├─ RDS Read Replica
├─ ElastiCache                   └─ Route 53 health check
└─ S3 (cross-region replication)

On failure:
1. Route 53 detects primary unhealthy
2. DNS fails over to us-west-2
3. ASG scales from 1 → 4 instances
4. RDS replica promoted to primary
5. ~5 minute recovery time

⚠️ Warning

Test your DR plan! An untested DR plan is not a plan — it's a hope. Run DR drills quarterly. AWS Fault Injection Simulator (FIS) can simulate AZ and region failures.

Observability

Observability answers: "What is my system doing and why?"It's built on three pillars:

The Three Pillars

📊 Metrics

Numeric measurements over time. CPU at 72%, response time p99 = 230ms, 5xx rate = 0.3%. Used for alerting and dashboards.

📝 Logs

Timestamped event records. "User 123 failed login at 14:32:01." Used for debugging specific issues.

🔗 Traces

Follow a request across services. Request → API → Auth Service → DB → Cache. Used for finding bottlenecks in distributed systems.

AWS Observability Stack

Pillar	AWS Service	Alternative
Metrics	CloudWatch Metrics	Datadog, Prometheus + Grafana
Logs	CloudWatch Logs	ELK Stack, Datadog Logs
Traces	X-Ray	Jaeger, Datadog APM
Dashboards	CloudWatch Dashboards	Grafana
Alerting	CloudWatch Alarms → SNS	PagerDuty, OpsGenie

The Golden Signals

Google's SRE book defines four signals every service should monitor:

Latency: How long requests take (track p50, p95, p99)
Traffic: How many requests per second (demand)
Errors: Rate of failed requests (5xx, timeouts)
Saturation: How "full" your system is (CPU, memory, disk, connections)

📘 Key Concept

Alert on symptoms, not causes. Alert on "5xx rate > 5%" (symptom), not "CPU > 80%" (cause). High CPU that doesn't affect users isn't an emergency. High error rates always are.

Immutable Infrastructure

Instead of updating servers in place (mutable), you replace them entirelywith new versions. Think of servers like cattle, not pets.

🐕 Pets (Mutable)

Hand-configured servers you care for individually. SSH in, install packages, tweak configs. Unique snowflakes.

🐄 Cattle (Immutable)

Identical, disposable instances from a template. If one fails, terminate and launch a new one. No SSH needed.

Benefits of Immutable Infrastructure

No configuration drift: Every instance is identical — built from the same AMI/image
Reliable rollbacks: Deploy old version = launch old AMI
Easier debugging: "It works in staging" actually means something when staging = production image
Security: No SSH access needed. Smaller attack surface

The Immutable Pipeline

text

Code Change → Build → Create AMI/Image → Deploy New Instances → Terminate Old

Developer pushes code
     │
     ▼
CodeBuild: npm install, test, build
     │
     ▼
Packer: Bake AMI with app + dependencies
     │
     ▼
Terraform/CloudFormation: Update Launch Template with new AMI
     │
     ▼
ASG: Rolling replacement of old instances with new ones

More Key Concepts

Infrastructure as Code (IaC)

Manage infrastructure through code instead of manual console clicks. Version-controlled, reviewable, repeatable, and testable. CloudFormation and Terraform are the two main tools (covered in Module 5).

GitOps

Git is the single source of truth for infrastructure and application state. Changes are made via Pull Requests. A reconciliation tool (ArgoCD, Flux) ensures the live state matches what's in git.

text

Developer → PR to git → Approved → Merged
                                       ↓
ArgoCD watches git → Detects change → Applies to K8s cluster
                                       ↓
Cluster state = Git state (always)

12-Factor App Principles (for Cloud-Native)

#	Factor	In Practice
1	Codebase	One repo per service, tracked in git
2	Dependencies	Explicitly declared (package.json, requirements.txt)
3	Config	Store in environment variables, not code
4	Backing services	Treat databases, queues as attached resources
5	Build, release, run	Separate build/release/run stages (CI/CD)
6	Processes	Stateless — store state in Redis/DB, not memory
7	Port binding	Export services via port (e.g., Express on :3000)
8	Concurrency	Scale via processes (horizontal scaling)
9	Disposability	Fast startup, graceful shutdown
10	Dev/prod parity	Keep environments as similar as possible
11	Logs	Treat as event streams (stdout → CloudWatch)
12	Admin processes	Run admin tasks (migrations) as one-off processes

Idempotency

An operation is idempotent if running it multiple times produces the same result as running it once. This is critical for:

IaC: terraform apply can be run repeatedly without side effects
APIs: Retrying a PUT request doesn't create duplicate records
Deployments: Re-running a deployment script doesn't break the existing setup
Ansible playbooks: Run 10 times, same result every time

Principle of Least Privilege

Grant only the minimum permissions needed for a task. Every IAM role, security group, and NACL should follow this principle. If a service only reads from S3, don't give it write access.

🧪

Exercise: Map Concepts to the Sandbox Architecture

Look at the AWS Sandbox architecture and identify where each concept applies:

text

Architecture:
  Route 53 → CloudFront → ALB → ASG (2-4 EC2s) → RDS Multi-AZ

✓ High Availability:     ALB across 2 AZs + RDS Multi-AZ failover
✓ Load Balancing:        ALB distributes to EC2 Target Group
✓ Autoscaling:           ASG Target Tracking (CPU 70%)
✓ Health Checks:         ALB → /health endpoint every 30s
✓ Immutable Infra:       Launch Template + AMI, new instances replace old
✓ Deployment Strategy:   CodeDeploy OneAtATime (rolling)
✓ IaC:                   CloudFormation 7 stacks / Terraform modules
✓ Observability:         CloudWatch metrics + ALB access logs
✓ Least Privilege:       SG chain (ALB → EC2 → RDS)
✓ Statelessness:         App stores nothing in memory, all in RDS

Key Takeaways

HA = redundancy + health checks + auto-failover across multiple AZs
Autoscaling = use Target Tracking for simplicity, always include scale-in
Deployment strategies = rolling for simplicity, blue/green for safety, canary for precision
DR = define RTO/RPO first, then choose strategy. Test your DR plan!
Observability = metrics + logs + traces. Alert on symptoms, not causes
Immutable infra = cattle not pets. Replace, don't repair
These concepts are cloud-agnostic — they apply to AWS, Azure, GCP, and on-prem