Module 9

AWS Interview Prep

30+ curated interview questions for mid and senior DevOps engineers. Model answers with explanations of what interviewers look for.

Mid-LevelSenior-LevelScenario-BasedArchitecture

How to Use This Module

30+ real interview questions organised by level. Each question includes a model answer and an explanation of what the interviewer is actually evaluating. Practice by covering the answers and answering out loud first.

🟡 Mid-Level

Solid fundamentals, practical experience, can implement and troubleshoot confidently.

🔴 Senior-Level

Architecture decisions, trade-offs, scaling, mentoring, and cross-team influence.

Networking & VPC

🟡

What is the difference between a public and private subnet?

A public subnet has a route to an Internet Gateway (IGW) in its route table and instances can have public IPs. A private subnetroutes outbound traffic through a NAT Gateway (for updates, API calls etc.) but has no inbound path from the internet.

The key distinction is the route table, not the subnet itself. Any subnet becomes "public" once you add a 0.0.0.0/0 → IGW route.

🎯 What the interviewer is looking for

Understanding that 'public' vs 'private' isn't a subnet property but a route table configuration. Bonus: mentioning that EC2s in private subnets should still get outbound access via NAT GW.

🟡

How do Security Groups differ from NACLs?

Feature	Security Group	NACL
Level	Instance (ENI)	Subnet
State	Stateful	Stateless
Rules	Allow only	Allow + Deny
Evaluation	All rules evaluated	Rules evaluated in order
Default	Deny all inbound	Allow all

In practice, Security Groups are your primary control. NACLs are a secondary defense layer, typically left at defaults unless you need explicit deny rules (e.g., blocking a specific IP range).

🎯 What the interviewer is looking for

Stateful vs stateless distinction, and knowing that SGs are the primary tool while NACLs are defense-in-depth. Seniors should mention that NACLs can block IP ranges that SGs can't deny.

🔴

Design a VPC architecture for a multi-tier application that needs to be highly available and secure.

I'd design a 3-tier architecture across at least 2 AZs:

text

VPC: 10.0.0.0/16

Public Subnets (2 AZs):
  ├── ALB
  ├── NAT Gateways (one per AZ for HA)
  └── Bastion host (if needed, prefer SSM)

Private App Subnets (2 AZs):
  ├── EC2 ASG / ECS / EKS worker nodes
  └── No direct internet access

Private Data Subnets (2 AZs):
  ├── RDS Multi-AZ
  └── ElastiCache

Security Group Chain:
  ALB SG → App SG → DB SG (reference by SG ID, not CIDR)

Key decisions: NAT GW per AZ avoids cross-AZ dependency. Security groups reference each other (not CIDRs) so they stay valid if IPs change. VPC Flow Logs enabled for audit.

🎯 What the interviewer is looking for

Multi-AZ design, proper subnet tiering, SG chaining by reference (not CIDR), NAT GW per AZ for HA, and awareness of cost implications (NAT GW is ~$32/mo each). Mentioning VPC Flow Logs and SSM over bastion shows maturity.

Compute & Scaling

🟡

What happens when an EC2 instance in an ASG fails a health check?

The ASG marks the instance as unhealthy and begins the replacement process: it terminates the unhealthy instance and launches a new one from the Launch Template. The new instance is registered with the Target Group and must pass ALB health checks before receiving traffic.

There are two health check types: EC2 (system-level — host unreachable) and ELB (application-level — HTTP check fails). You should use ELB health checks for production ASGs.

🎯 What the interviewer is looking for

Knowledge of the terminate-and-replace cycle, the difference between EC2 and ELB health checks, and that ELB checks are preferred because they catch application-level failures, not just VM crashes.

🔴

How would you choose between EC2, ECS, EKS, and Lambda for a new microservice?

The choice depends on several factors:

Factor	EC2	ECS	EKS	Lambda
Team K8s skill	—	—	Required	—
Startup time	Minutes	Seconds	Seconds	Milliseconds
Max duration	Unlimited	Unlimited	Unlimited	15 min
Ops overhead	High	Medium	High	Minimal
Cost model	Per hour	Per task	Per pod	Per request

My decision framework: Lambda for event-driven, sub-15min workloads. ECS Fargate if the team doesn't know K8s and wants containers. EKS if K8s skills exist and you need portability. EC2 only for long-running stateful workloads or GPU.

🎯 What the interviewer is looking for

A structured decision framework (not just 'it depends'). Understanding trade-offs: operational overhead, team skills, cost models, and runtime constraints. Seniors should have opinions backed by reasoning.

🟡

Explain User Data vs. AMI baking. When would you use each?

User Data runs a script on first boot — great for small config changes and injecting variables. AMI baking (e.g., with Packer) pre-installs all software into a golden image.

Use User Data for: env-specific config, pulling secrets, dynamic values. Use baked AMIs for: consistent, fast boot (no downloads), production deployments where ASG needs to scale quickly.

Ideal approach: baked AMI + minimal User Data. The AMI has all software installed; User Data only sets environment-specific config.

🎯 What the interviewer is looking for

Understanding the trade-off: User Data = flexible but slow (downloads on every boot), AMI = fast but requires rebuild. The 'baked AMI + minimal User Data' answer shows production thinking.

Database & Storage

🟡

What is RDS Multi-AZ and how does failover work?

Multi-AZ creates a synchronous standby replica in another Availability Zone. The standby receives every write in real-time but is not accessible for reads.

During failover (hardware failure, AZ outage, maintenance): AWS automatically promotes the standby to primary. The DNS CNAME endpoint stays the same, so your application reconnects automatically. Failover typically takes 60-120 seconds.

🎯 What the interviewer is looking for

Understanding that Multi-AZ is for HA (not read scaling — that's Read Replicas). Knowing it's synchronous replication and that the DNS endpoint stays the same during failover.

🔴

A team's RDS PostgreSQL is hitting performance limits. Walk me through your diagnosis and remediation.

Step 1: Diagnose

CloudWatch: CPU, FreeableMemory, ReadIOPS, WriteIOPS, DatabaseConnections
Performance Insights: find top SQL queries by wait time
pg_stat_statements for query-level metrics

Step 2: Quick wins

Add missing indexes (from slow query analysis)
Connection pooling with PgBouncer or RDS Proxy
Enable query caching / optimize N+1 queries

Step 3: Scale

Vertical: Resize instance class (e.g., db.r6g.xlarge)
Read Replicas for read-heavy workloads
ElastiCache for frequently-read data

Step 4: Architecture

Consider Aurora PostgreSQL (5x throughput, auto-scaling storage)
Shard if write-bottlenecked
CQRS pattern: separate read and write paths

🎯 What the interviewer is looking for

A methodical approach: diagnose before optimizing. Using Performance Insights (not just CloudWatch). Knowing the progression from indexing → pooling → vertical scaling → read replicas → Aurora. Seniors should mention RDS Proxy and connection pooling early.

CI/CD & Deployment

🟡

What is the difference between blue/green and rolling deployments?

Blue/Green: Run two identical environments. Deploy to green (idle), test, then route traffic from blue to green. Instant rollback = switch back to blue. Costs 2x resources during deployment.

Rolling: Update instances one (or a batch) at a time within the same environment. Lower cost but slower rollback — you must re-deploy the old version.

Use blue/green for critical services where instant rollback is essential. Use rolling for cost-sensitive or stateful services. AWS CodeDeploy supports both strategies.

🎯 What the interviewer is looking for

Clear distinction between the two, understanding rollback implications, and cost trade-offs. Knowing that CodeDeploy supports both natively.

🔴

Design a CI/CD pipeline for a microservices architecture on AWS.

text

Developer Push
     │
     ▼
CodePipeline (per service)
  ├── Source:   GitHub (CodeStar Connection)
  ├── Build:    CodeBuild
  │   ├── Unit tests
  │   ├── SAST scan (Semgrep/SonarQube)
  │   ├── Docker build → ECR
  │   └── Artifact: Helm chart / task definition
  ├── Deploy Staging:
  │   ├── ECS/EKS deployment
  │   ├── Integration tests
  │   └── Manual approval gate
  └── Deploy Production:
      ├── Blue/green or canary
      ├── CloudWatch alarms → auto-rollback
      └── Post-deploy smoke tests

Key design decisions:

One pipeline per service — independent deployment cadence
Immutable artifacts — same Docker image from staging to prod
Auto-rollback — tied to CloudWatch alarms (5xx rate, latency)
Infrastructure changes in a separate pipeline with Terraform
Secrets via AWS Secrets Manager, never in code or env vars

🎯 What the interviewer is looking for

Pipeline-per-service design, security scanning built in, immutable artifacts, environment promotion (not rebuild), auto-rollback tied to metrics, and separation of app and infra pipelines. Mentioning canary deployments shows advanced thinking.

Infrastructure as Code

🟡

What is Terraform state and why is remote state important?

Terraform state (terraform.tfstate) is a JSON file that maps your HCL config to real AWS resources. It tracks resource IDs, dependency order, and metadata so Terraform knows what exists and what needs to change.

Remote state (e.g., S3 + DynamoDB locking) is important because:

Multiple team members can collaborate without conflicting
State often contains secrets — S3 encryption protects it
DynamoDB locking prevents concurrent apply operations
CI/CD pipelines need a shared source of truth

🎯 What the interviewer is looking for

Understanding what state tracks (mapping of config to real resources), why it shouldn't be in git (secrets), and the locking mechanism (DynamoDB). Bonus: mentioning state data sensitivity.

🔴

How do you handle Terraform state drift in production?

Prevention:

All changes through IaC pipelines — no manual console changes
Use terraform plan in scheduled CI jobs to detect drift
AWS Config rules to detect non-compliant resources

Detection:

bash

# Scheduled drift detection
terraform plan -detailed-exitcode
# Exit 0 = no changes, Exit 2 = drift detected

# Refresh state to match reality
terraform refresh  # (or terraform apply -refresh-only)

Resolution:

If manual change was needed: terraform import the resource
If manual change was wrong: terraform apply to override
If state is corrupted: terraform state rm + re-import

🎯 What the interviewer is looking for

A three-part strategy: prevent (pipeline-only changes), detect (scheduled plans, AWS Config), and resolve (import, apply, state surgery). Seniors should emphasize that preventing drift is more important than detecting it.

Security & IAM

🟡

Explain the principle of least privilege in IAM. How do you apply it?

Least privilege means granting only the minimum permissionsrequired for a task. In practice:

Never use * for actions or resources in production policies
Scope permissions to specific resources using ARNs
Use conditions (e.g., aws:SourceIp, aws:RequestedRegion)
Prefer IAM Roles over access keys (no credentials to rotate)
Use IAM Access Analyzer to find unused permissions

json

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": "arn:aws:s3:::my-app-bucket/uploads/*",
  "Condition": {
    "StringEquals": {
      "aws:RequestedRegion": "us-east-1"
    }
  }
}

🎯 What the interviewer is looking for

Specific examples, not just the definition. Mentioning resource-scoped ARNs, conditions, preferring roles over keys, and practical tooling like IAM Access Analyzer.

🔴

How do you manage secrets in an AWS microservices architecture?

A layered approach:

AWS Secrets Manager for database credentials, API keys — supports auto-rotation
SSM Parameter Store (SecureString) for configuration values — free, KMS-encrypted
At deployment: Inject secrets as env vars via ECS task definitions or K8s ExternalSecrets
At build: Never in Dockerfiles or buildspec. Use CodeBuild env var references to Secrets Manager
Rotation: Automated rotation for RDS credentials. Lambda-based rotation for custom secrets
Audit: CloudTrail logs every Secrets Manager API call

Anti-patterns I'd flag: .env files in repos, hardcoded credentials in User Data, sharing access keys across services, and not rotating secrets.

🎯 What the interviewer is looking for

A comprehensive secrets strategy, not just 'use Secrets Manager'. Should cover storage, injection, rotation, and audit. Knowing anti-patterns shows real-world experience. Mentioning KMS, CloudTrail, and auto-rotation shows depth.

Monitoring & Incident Response

🟡

What CloudWatch metrics would you alert on for a web application?

Tier the alerts by severity:

Priority	Metric	Threshold
🔴 P1	ALB 5xx rate	> 5% for 2 min
🔴 P1	Target healthy count	< 2 instances
🟡 P2	ALB response latency (p99)	> 2s for 5 min
🟡 P2	RDS CPU	> 80% for 10 min
🟡 P2	RDS free storage	< 20%
🟢 P3	ASG CPU average	> 70% for 15 min
🟢 P3	4xx rate	> 10% (may indicate client bugs)

🎯 What the interviewer is looking for

Not just listing metrics but prioritizing them. Understanding that 5xx and healthy host count are P1 (customer-facing), while CPU is informational. Knowing specific thresholds shows operational experience.

🔴

Walk me through how you'd handle a production outage — your app is returning 503s.

First 5 minutes (Triage):

CloudWatch dashboard: ALB 5xx spike, healthy host count
If 0 healthy hosts → instance-level issue. If hosts up but 503 → app level
Check recent deployments in CodePipeline (most common cause)

Diagnose:

ALB target group: which instances are unhealthy?
SSH/SSM to an unhealthy instance → check app logs, port 3000, disk/memory
If app won't start → check /health endpoint locally, review recent code changes
If DB timeout → check RDS metrics, security group, connection count

Mitigate:

If deployment caused it → rollback immediately (don't debug in prod)
If capacity issue → manually scale ASG, consider instance type bump
If database → check connection pooling, kill long-running queries

After resolution:

Blameless post-mortem within 48 hours
Document: timeline, root cause, action items
Improve: add missing alerts, update runbooks, add integration tests

🎯 What the interviewer is looking for

A structured incident response (not jumping to conclusions). Checking the most common cause first (recent deployment). The instinct to rollback before debugging shows production maturity. Mentioning blameless post-mortems shows leadership.

Architecture & Scenario-Based

🔴

You're tasked with migrating a monolith to microservices on AWS. How do you approach this?

Phase 1: Strangler Fig Pattern

Don't rewrite everything at once. Put an ALB/API Gateway in front and incrementally route specific endpoints to new services while the monolith handles the rest.

Phase 2: Identify Seams

Domain boundaries (DDD bounded contexts)
Start with a low-risk, well-defined service (e.g., notifications, search)
Each service gets its own database (avoid shared DB anti-pattern)

Phase 3: Infrastructure

ECS Fargate or EKS for container orchestration
EventBridge or SQS for async communication between services
Service mesh (App Mesh) for observability across services
CloudMap for service discovery

Phase 4: Operationalize

Each service has its own CI/CD pipeline
Centralized logging (CloudWatch Logs / OpenSearch)
Distributed tracing (X-Ray)

🎯 What the interviewer is looking for

Strangler fig pattern (not big-bang rewrite). Starting with low-risk services. Database-per-service principle. Understanding of async communication, service discovery, and observability. This question tests architectural thinking and migration experience.

🔴

How would you design a system that handles 10,000 requests per second with sub-100ms latency?

Work backward from the requirements:

CDN layer: CloudFront for static content — offload 70-80% of requests
Compute: ECS/EKS + Fargate with HPA, or Lambda for truly stateless operations
Caching: ElastiCache Redis in front of the database — sub-ms reads
Database: Aurora with read replicas, or DynamoDB for simple key-value (single-digit ms)
Async: Decouple non-critical paths with SQS (write to queue, process later)

Latency budget:

text

Total:   < 100ms
  ├── Network (CloudFront edge):  ~5ms
  ├── ALB + TLS:                  ~5ms
  ├── App logic:                 ~20ms
  ├── Cache hit (Redis):          ~1ms
  └── DB (if cache miss):        ~30ms
  Margin:                        ~39ms

Key: cache everything possible, scale horizontally, keep services stateless, use read replicas, and measure with X-Ray to find bottlenecks.

🎯 What the interviewer is looking for

A layered approach with specific numbers. CDN to reduce load, caching strategy, horizontal scaling, and a latency budget breakdown. This question separates seniors who think in systems from those who think in components.

🟡

What's the difference between SQS and SNS? When would you use each?

SQS (Simple Queue Service) is a message queue — one producer sends, one consumer processes. Messages are pulled and deleted after processing. Use for: decoupling, work queues, buffering.

SNS (Simple Notification Service) is pub/sub — one publisher, many subscribers (Lambda, SQS, email, HTTP). Messages are pushed immediately. Use for: fan-out, notifications, event broadcasting.

Common pattern: SNS → multiple SQS queues (fan-out). An order event triggers SNS, which fans out to an email queue, an inventory queue, and an analytics queue.

🎯 What the interviewer is looking for

Clear pull (SQS) vs push (SNS) distinction, and knowing the SNS → SQS fan-out pattern. Bonus: mentioning SQS FIFO for ordering guarantees and dead-letter queues for error handling.

Interview Tips

🏗️ Structure Answers

Use frameworks: diagnose → plan → execute → verify. Don't jump to solutions before understanding the problem.

📊 Use Numbers

Mention specific thresholds, costs, and trade-offs. "t3.micro is free tier" beats "use a small instance."

🤝 Show Trade-offs

Senior answers always include "the downside of this approach is..." and "alternatively, we could..."

💡 Relate to Experience

"In my last project, we chose X because..." is more memorable than a textbook answer.