Module 10

DevOps & Cloud Concepts

Essential DevOps and cloud computing concepts every engineer should know. HA, scaling, deployment strategies, disaster recovery, and more.

High AvailabilityAutoscalingBlue/GreenDRObservability

Why This Module?

Before diving into specific AWS services, it helps to understand the foundational concepts that underpin every cloud architecture. These ideas are cloud-agnostic โ€” they apply equally to AWS, Azure, and GCP.

๐Ÿ—๏ธ High Availability

Design systems that keep running even when components fail.

โš–๏ธ Load Balancing

Distribute traffic to prevent overload and improve response times.

๐Ÿ“ˆ Autoscaling

Automatically adjust capacity to match demand โ€” up and down.

๐Ÿš€ Deployment Strategies

Ship changes safely with blue/green, canary, and rolling deployments.

๐Ÿ›ก๏ธ Disaster Recovery

Recover from catastrophic failures with defined RPO and RTO targets.

๐Ÿ‘๏ธ Observability

Understand what's happening inside your systems with metrics, logs, and traces.


High Availability (HA)

High availability means your system continues operating when components fail. It's measured as a percentage of uptime:

AvailabilityDowntime / YearDowntime / MonthTypical Use
99% ("two nines")3.65 days7.3 hoursInternal tools
99.9% ("three nines")8.76 hours43 minTypical SaaS
99.99% ("four nines")52.6 min4.3 minE-commerce, banking
99.999% ("five nines")5.26 min26 secMission-critical

Key HA Patterns

  • Redundancy: No single point of failure. Run 2+ instances, across 2+ Availability Zones
  • Health checks: Continuously monitor components and automatically replace unhealthy ones
  • Failover: Automatic promotion of standby systems (e.g., RDS Multi-AZ)
  • Statelessness: Store session data externally (Redis/DynamoDB) so any instance can serve any request
๐Ÿ“˜ Key Concept

The HA equation: To get 99.99% availability from components that are individually 99.9% available, you need redundancy. Two independent components in an active-active setup: 1 - (0.001 ร— 0.001) = 99.9999%.

HA on AWS

text
Single instance (no HA):
  User โ†’ EC2 โ†’ RDS
  Availability: ~99.5%

Basic HA (Multi-AZ):
  User โ†’ ALB โ†’ [EC2 AZ-a, EC2 AZ-b] โ†’ RDS Multi-AZ
  Availability: ~99.99%

Full HA (Multi-Region):
  User โ†’ Route 53 (failover) โ†’ [Region A: ALB โ†’ ASG โ†’ Aurora]
                               [Region B: ALB โ†’ ASG โ†’ Aurora Replica]
  Availability: ~99.999%

Load Balancing

A load balancer distributes incoming requests across multiple targets to prevent any single server from becoming a bottleneck.

Load Balancing Algorithms

๐Ÿ”„ Round Robin

Sends requests to each server in sequence. Simple but doesn't account for server load. Default for many LBs.

โšก Least Connections

Routes to the server with fewest active connections. Better for uneven request durations.

๐Ÿ“Œ IP Hash

Same client IP always goes to the same server. Useful for session affinity (sticky sessions).

โš–๏ธ Weighted

Assign weights to servers. Useful during canary deployments (send 5% to new version).

AWS Load Balancer Types

TypeLayerBest ForKey Feature
ALBLayer 7 (HTTP)Web apps, microservicesPath/host-based routing
NLBLayer 4 (TCP/UDP)Gaming, IoT, ultra-low latencyMillions of req/s, static IP
GWLBLayer 3 (IP)Firewalls, IDS/IPSTransparent network appliances
๐Ÿ’ก Tip

Rule of thumb: If your app speaks HTTP โ†’ ALB. If you need raw TCP/UDP performance โ†’ NLB. You'll use ALB 90% of the time.

Health Checks

Load balancers rely on health checks to know which targets are alive. A health check sends a request (e.g., GET /health) at regular intervals. If a target fails consecutive checks, it's removed from rotation.

text
Health Check Config:
  Path:              /health
  Interval:          30 seconds
  Healthy threshold: 2 consecutive successes โ†’ mark healthy
  Unhealthy threshold: 3 consecutive failures โ†’ mark unhealthy
  Timeout:           5 seconds per check

Autoscaling

Autoscaling automatically adjusts the number of compute instances based on demand. It scales out (add instances) under load and scales in (remove instances) when demand drops.

Scaling Types

TypeHow It WorksExample
Horizontal (scale out/in)Add/remove instancesASG: 2 โ†’ 6 EC2s during peak
Vertical (scale up/down)Resize instancet3.micro โ†’ t3.large (requires restart)
ScheduledScale at known timesScale up at 8am, down at 8pm
PredictiveML-based forecastingAWS learns your traffic patterns

Scaling Policies

text
Target Tracking (recommended):
  "Keep average CPU at 60%"
  โ†’ ASG automatically adds/removes instances to maintain target
  โ†’ Simple, self-managing

Step Scaling:
  CPU > 60% โ†’ add 1 instance
  CPU > 80% โ†’ add 3 instances
  CPU < 30% โ†’ remove 1 instance
  โ†’ More control, more config

Simple Scaling:
  CPU > 70% โ†’ add 1 instance, then wait 300s cooldown
  โ†’ Basic, legacy โ€” prefer Target Tracking
๐Ÿ“˜ Key Concept

Cooldown period: After a scaling action, the ASG waits (default 300s) before acting again. This prevents "thrashing" โ€” rapidly scaling up and down. Target Tracking handles this automatically.

Scaling on AWS โ€” The Key Services

  • EC2 Auto Scaling Groups: Scale EC2 instances horizontally
  • ECS Service Auto Scaling: Scale container tasks based on CPU/memory or custom metrics
  • K8s HPA: Scale pods on EKS based on metrics
  • Lambda: Scales automatically with no config (up to account concurrency limit)
  • DynamoDB: Auto-scales read/write capacity units
  • Aurora: Auto-scales read replicas and storage

Deployment Strategies

How you ship changes to production determines your risk, speed, and rollback capabilities. Here are the main strategies:

1. Rolling Deployment

text
Time โ†’
Instance 1:  [v1] [v1] [v2] [v2] [v2]
Instance 2:  [v1] [v1] [v1] [v2] [v2]
Instance 3:  [v1] [v1] [v1] [v1] [v2]

โœ… Low cost (no extra instances)
โœ… Gradual rollout
โŒ During update, mix of v1 and v2 serving traffic
โŒ Slow rollback (must re-deploy v1)

2. Blue/Green Deployment

text
                    โ”Œโ”€โ”€ Blue (v1) โ† LIVE
LB โ”€โ”€โ”ค
                    โ””โ”€โ”€ Green (v2) โ† Idle, testing

After validation:

                    โ”Œโ”€โ”€ Blue (v1) โ† Idle (keep for rollback)
LB โ”€โ”€โ”ค
                    โ””โ”€โ”€ Green (v2) โ† LIVE โœ“

โœ… Instant rollback (switch back to blue)
โœ… Zero-downtime
โœ… Full testing before go-live
โŒ Costs 2x resources during deployment

3. Canary Deployment

text
Time โ†’
Phase 1:  5% traffic โ†’ v2, 95% โ†’ v1    (test with small %)
Phase 2:  25% traffic โ†’ v2, 75% โ†’ v1   (if metrics OK, increase)
Phase 3:  100% traffic โ†’ v2             (full rollout)

โœ… Lowest risk โ€” errors affect few users
โœ… Data-driven decision (watch error rates)
โŒ Most complex to implement
โŒ Need good observability to detect issues

4. A/B Testing (Traffic Splitting)

Similar to canary but the split is for feature comparison, not just safety. Route specific user segments to different versions and measure business metrics (conversion rate, engagement).

๐Ÿ’ก Tip

Which to use? Start with rolling for simplicity. Graduate to blue/green for critical services. Use canarywhen you have mature observability and need the safest possible rollouts.

AWS Implementation

StrategyAWS ServiceHow
RollingCodeDeploy (OneAtATime)Update instances sequentially
Blue/GreenCodeDeploy + ASGCreate new ASG, swap ALB target
CanaryCodeDeploy (Canary10Percent5Min)Route 10% first, full after 5 min
Canary (K8s)Flagger + Istio on EKSProgressive delivery with metrics
Blue/Green (Lambda)Lambda Aliases + WeightsTraffic shift between versions

Disaster Recovery (DR)

DR is your plan for recovering from catastrophic failures โ€” region outages, data corruption, or complete infrastructure loss. Two critical metrics define DR:

โฑ๏ธ RTO (Recovery Time Objective)

How fast you need to recover. "We must be back online within 1 hour."

๐Ÿ’พ RPO (Recovery Point Objective)

How much data loss is acceptable. "We can lose at most 15 minutes of data."

DR Strategies (by cost & speed)

StrategyRTORPOCostDescription
Backup & RestoreHoursHours๐Ÿ’ฐRegular backups to S3. Restore when needed.
Pilot Light10-30 minMinutes๐Ÿ’ฐ๐Ÿ’ฐCore infra running but idle. Scale up on disaster.
Warm StandbyMinutesSeconds๐Ÿ’ฐ๐Ÿ’ฐ๐Ÿ’ฐScaled-down copy running in DR region. Scale up on failover.
Multi-Site Active-ActiveNear-zeroNear-zero๐Ÿ’ฐ๐Ÿ’ฐ๐Ÿ’ฐ๐Ÿ’ฐFull copies in both regions. Route 53 health checks failover.
text
Example: Warm Standby

Primary Region (us-east-1):         DR Region (us-west-2):
โ”Œโ”€ ALB โ†’ ASG (4 instances)       โ”Œโ”€ ALB โ†’ ASG (1 instance)
โ”œโ”€ RDS Primary                   โ”œโ”€ RDS Read Replica
โ”œโ”€ ElastiCache                   โ””โ”€ Route 53 health check
โ””โ”€ S3 (cross-region replication)

On failure:
1. Route 53 detects primary unhealthy
2. DNS fails over to us-west-2
3. ASG scales from 1 โ†’ 4 instances
4. RDS replica promoted to primary
5. ~5 minute recovery time
โš ๏ธ Warning

Test your DR plan! An untested DR plan is not a plan โ€” it's a hope. Run DR drills quarterly. AWS Fault Injection Simulator (FIS) can simulate AZ and region failures.


Observability

Observability answers: "What is my system doing and why?"It's built on three pillars:

The Three Pillars

๐Ÿ“Š Metrics

Numeric measurements over time. CPU at 72%, response time p99 = 230ms, 5xx rate = 0.3%. Used for alerting and dashboards.

๐Ÿ“ Logs

Timestamped event records. "User 123 failed login at 14:32:01." Used for debugging specific issues.

๐Ÿ”— Traces

Follow a request across services. Request โ†’ API โ†’ Auth Service โ†’ DB โ†’ Cache. Used for finding bottlenecks in distributed systems.

AWS Observability Stack

PillarAWS ServiceAlternative
MetricsCloudWatch MetricsDatadog, Prometheus + Grafana
LogsCloudWatch LogsELK Stack, Datadog Logs
TracesX-RayJaeger, Datadog APM
DashboardsCloudWatch DashboardsGrafana
AlertingCloudWatch Alarms โ†’ SNSPagerDuty, OpsGenie

The Golden Signals

Google's SRE book defines four signals every service should monitor:

  • Latency: How long requests take (track p50, p95, p99)
  • Traffic: How many requests per second (demand)
  • Errors: Rate of failed requests (5xx, timeouts)
  • Saturation: How "full" your system is (CPU, memory, disk, connections)
๐Ÿ“˜ Key Concept

Alert on symptoms, not causes. Alert on "5xx rate > 5%" (symptom), not "CPU > 80%" (cause). High CPU that doesn't affect users isn't an emergency. High error rates always are.


Immutable Infrastructure

Instead of updating servers in place (mutable), you replace them entirelywith new versions. Think of servers like cattle, not pets.

๐Ÿ• Pets (Mutable)

Hand-configured servers you care for individually. SSH in, install packages, tweak configs. Unique snowflakes.

๐Ÿ„ Cattle (Immutable)

Identical, disposable instances from a template. If one fails, terminate and launch a new one. No SSH needed.

Benefits of Immutable Infrastructure

  • No configuration drift: Every instance is identical โ€” built from the same AMI/image
  • Reliable rollbacks: Deploy old version = launch old AMI
  • Easier debugging: "It works in staging" actually means something when staging = production image
  • Security: No SSH access needed. Smaller attack surface

The Immutable Pipeline

text
Code Change โ†’ Build โ†’ Create AMI/Image โ†’ Deploy New Instances โ†’ Terminate Old

Developer pushes code
     โ”‚
     โ–ผ
CodeBuild: npm install, test, build
     โ”‚
     โ–ผ
Packer: Bake AMI with app + dependencies
     โ”‚
     โ–ผ
Terraform/CloudFormation: Update Launch Template with new AMI
     โ”‚
     โ–ผ
ASG: Rolling replacement of old instances with new ones

More Key Concepts

Infrastructure as Code (IaC)

Manage infrastructure through code instead of manual console clicks. Version-controlled, reviewable, repeatable, and testable. CloudFormation and Terraform are the two main tools (covered in Module 5).

GitOps

Git is the single source of truth for infrastructure and application state. Changes are made via Pull Requests. A reconciliation tool (ArgoCD, Flux) ensures the live state matches what's in git.

text
Developer โ†’ PR to git โ†’ Approved โ†’ Merged
                                       โ†“
ArgoCD watches git โ†’ Detects change โ†’ Applies to K8s cluster
                                       โ†“
Cluster state = Git state (always)

12-Factor App Principles (for Cloud-Native)

#FactorIn Practice
1CodebaseOne repo per service, tracked in git
2DependenciesExplicitly declared (package.json, requirements.txt)
3ConfigStore in environment variables, not code
4Backing servicesTreat databases, queues as attached resources
5Build, release, runSeparate build/release/run stages (CI/CD)
6ProcessesStateless โ€” store state in Redis/DB, not memory
7Port bindingExport services via port (e.g., Express on :3000)
8ConcurrencyScale via processes (horizontal scaling)
9DisposabilityFast startup, graceful shutdown
10Dev/prod parityKeep environments as similar as possible
11LogsTreat as event streams (stdout โ†’ CloudWatch)
12Admin processesRun admin tasks (migrations) as one-off processes

Idempotency

An operation is idempotent if running it multiple times produces the same result as running it once. This is critical for:

  • IaC: terraform apply can be run repeatedly without side effects
  • APIs: Retrying a PUT request doesn't create duplicate records
  • Deployments: Re-running a deployment script doesn't break the existing setup
  • Ansible playbooks: Run 10 times, same result every time

Principle of Least Privilege

Grant only the minimum permissions needed for a task. Every IAM role, security group, and NACL should follow this principle. If a service only reads from S3, don't give it write access.


๐Ÿงช

Exercise: Map Concepts to the Sandbox Architecture

Look at the AWS Sandbox architecture and identify where each concept applies:

text
Architecture:
  Route 53 โ†’ CloudFront โ†’ ALB โ†’ ASG (2-4 EC2s) โ†’ RDS Multi-AZ

โœ“ High Availability:     ALB across 2 AZs + RDS Multi-AZ failover
โœ“ Load Balancing:        ALB distributes to EC2 Target Group
โœ“ Autoscaling:           ASG Target Tracking (CPU 70%)
โœ“ Health Checks:         ALB โ†’ /health endpoint every 30s
โœ“ Immutable Infra:       Launch Template + AMI, new instances replace old
โœ“ Deployment Strategy:   CodeDeploy OneAtATime (rolling)
โœ“ IaC:                   CloudFormation 7 stacks / Terraform modules
โœ“ Observability:         CloudWatch metrics + ALB access logs
โœ“ Least Privilege:       SG chain (ALB โ†’ EC2 โ†’ RDS)
โœ“ Statelessness:         App stores nothing in memory, all in RDS

Key Takeaways

  • HA = redundancy + health checks + auto-failover across multiple AZs
  • Autoscaling = use Target Tracking for simplicity, always include scale-in
  • Deployment strategies = rolling for simplicity, blue/green for safety, canary for precision
  • DR = define RTO/RPO first, then choose strategy. Test your DR plan!
  • Observability = metrics + logs + traces. Alert on symptoms, not causes
  • Immutable infra = cattle not pets. Replace, don't repair
  • These concepts are cloud-agnostic โ€” they apply to AWS, Azure, GCP, and on-prem