Monitoring, Docker & AI
Role-specific deep dives into application monitoring (CloudWatch), Docker & ECR, DevOps toolchains (Jira/Git), generative AI for engineering, and customer-facing communication.
Role-Specific Focus Areas
This module targets skills frequently requested in DevOps Engineer roles in the mobility/automotive sector โ covering application monitoring, operational troubleshooting, Docker workflows, DevOps toolchains, and leveraging AI for engineering productivity. These are the "glue" skills that complement your infrastructure knowledge.
๐ Monitoring & Observability
CloudWatch, X-Ray, alarms, dashboards, and log analysis for production systems.
๐ณ Docker Deep-Dive
Images, containers, Dockerfiles, multi-stage builds, and AWS ECR integration.
๐ง Toolchain & Workflows
Jira, Confluence, Git branching strategies, and incident management processes.
๐ค AI for DevOps
Using generative AI to accelerate scripting, troubleshooting, documentation, and analysis.
1. Application Monitoring & Observability
As a DevOps Engineer you're responsible for the smooth operation of the system. This means knowing when something breaks before users report it. AWS provides three key monitoring pillars:
The Three Pillars
๐ Metrics
Numeric measurements over time โ CPU usage, request count, error rate. Stored in CloudWatch Metrics.
๐ Logs
Text records of events โ application output, error traces, access logs. Stored in CloudWatch Logs.
๐ Traces
Request paths across services โ which microservice was slow, where the error originated. Powered by AWS X-Ray.
CloudWatch โ Core Concepts
| Concept | What It Does | Example |
|---|---|---|
| Namespace | Groups related metrics | AWS/EC2, AWS/RDS, Custom/MyApp |
| Metric | A measurable value | CPUUtilization, RequestCount |
| Dimension | Filters a metric | InstanceId=i-abc123 |
| Period | Aggregation window | 60 seconds, 5 minutes |
| Alarm | Triggers when metric exceeds threshold | CPU > 80% for 5 minutes โ send email |
| Dashboard | Visual display of metrics | Combined view of EC2 + RDS + ALB health |
Console: CloudWatch โ Alarms โ Create alarm
Create a CPU utilization alarm for your ASG instances:
| Setting | Value |
|---|---|
| Metric | EC2 โ Per-Instance Metrics โ CPUUtilization |
| Statistic | Average |
| Period | 5 minutes |
| Threshold type | Static |
| Condition | Greater than 80 |
| Datapoints to alarm | 2 out of 3 |
Notification:
| Setting | Value |
|---|---|
| SNS Topic | Create new topic |
| Topic name | sandbox-alerts |
| Email endpoint | your-email@example.com |
Click Create alarm. Confirm the SNS subscription email.
Console: CloudWatch โ Dashboards โ Create dashboard
Create a unified operations dashboard:
| Setting | Value |
|---|---|
| Dashboard name | sandbox-ops |
Add these widgets:
- Line chart โ EC2 CPUUtilization (all instances)
- Line chart โ ALB
RequestCount+HTTPCode_Target_5XX_Count - Number โ RDS
DatabaseConnections - Line chart โ RDS
FreeStorageSpace - Number โ ALB
HealthyHostCount/UnHealthyHostCount
CloudWatch Logs โ Application Logging
Centralize your application logs so you can search, filter, and alert on them:
Console: CloudWatch โ Log groups โ Create log group
| Setting | Value |
|---|---|
| Log group name | /sandbox/application |
| Retention | 7 days (to save cost) |
Then install the CloudWatch Agent on your EC2 instances to ship logs (already enabled via the CloudWatchAgentServerPolicy IAM role from Module 2).
Key metrics to always monitor: CPU, Memory, Disk, Request rate, Error rate (5xx), Response time (latency), Database connections, Queue depth. These are the "golden signals" of observability.
2. Operational Troubleshooting
Identifying and resolving operational issues is a core DevOps responsibility. Here's a systematic approach:
The Troubleshooting Framework
1. DETECT โ Alarm fires or user reports issue
2. TRIAGE โ Severity? Blast radius? Who's affected?
3. DIAGNOSE โ Check metrics, logs, traces, recent changes
4. RESOLVE โ Apply fix (rollback, scale, config change)
5. POSTMORTEM โ Document root cause, add monitoring to prevent recurrenceCommon AWS Issues & Resolution
| Symptom | Check | Common Fix |
|---|---|---|
ALB returns 502 | Target group health + app logs | App crashed โ restart PM2, check memory |
| High CPU on EC2 | CloudWatch CPU metric + top | Scale out ASG or optimize code |
| RDS connections maxed | DatabaseConnections metric | Add connection pooling or scale instance |
| Deployment failed | CodeDeploy events + /var/log/aws/codedeploy-agent/ | Fix lifecycle hook scripts; rollback |
| Instances not launching | ASG Activity tab | Check launch template, SGs, subnet capacity |
| Slow response times | ALB TargetResponseTime | Optimize DB queries, add caching, scale out |
Hands-On: SSM Session Manager for Live Debugging
Connect to a running instance without SSH:
- Go to Systems Manager โ Session Manager โ Start session
- Select your EC2 instance โ click Start session
- You get a terminal โ run these diagnostic commands:
# Check app status
pm2 list
pm2 logs sandbox --lines 50
# Check system resources
top -bn1 | head -20
df -h
free -m
# Check network connectivity to RDS
nc -zv sandbox-db.xxxxx.rds.amazonaws.com 5432
# Check recent deployments
ls -la /opt/codedeploy-agent/deployment-root/
# View environment variables
cat /home/ec2-user/sandbox-app/app/.env3. Docker โ Container Fundamentals
Containers package your application with all its dependencies into a portable unit. Docker is the most common container runtime and a core DevOps skill.
Key Concepts
๐ฆ Image
A read-only blueprint. Built from a Dockerfile. Like a VM snapshot but lighter.
๐ Container
A running instance of an image. Isolated process with its own filesystem.
๐ Dockerfile
A script that defines how to build the image. Each instruction creates a layer.
๐๏ธ Registry
A storage for images. Docker Hub (public) or AWS ECR (private).
Dockerfile Example
# Multi-stage build: build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
# Production stage: minimal image
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app .
EXPOSE 3000
USER node
CMD ["node", "server.js"]Multi-stage builds keep images small. The build stage has all dev tools; the production stage only has what's needed to run. This reduces image size from ~900MB to ~150MB.
Essential Docker Commands
| Command | Purpose |
|---|---|
docker build -t myapp:v1 . | Build image from Dockerfile |
docker run -d -p 3000:3000 myapp:v1 | Run container in background |
docker ps | List running containers |
docker logs -f <id> | Follow container logs |
docker exec -it <id> sh | Open shell in container |
docker stop <id> | Stop a container |
docker images | List local images |
docker system prune | Clean up unused resources |
AWS ECR โ Private Container Registry
Elastic Container Registry (ECR) is AWS's private Docker registry. Store images securely and pull them into ECS, EKS, or EC2.
Console: ECR โ Repositories โ Create repository
| Setting | Value |
|---|---|
| Repository name | sandbox-app |
| Image tag mutability | Immutable (best practice) |
| Scan on push | โ Enabled |
Click Create repository.
Push an image to ECR:
# Authenticate Docker to ECR
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin \
ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com
# Build and tag
docker build -t sandbox-app .
docker tag sandbox-app:latest \
ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/sandbox-app:v1
# Push
docker push \
ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/sandbox-app:v1Docker vs VM Comparison
| Aspect | VM | Container |
|---|---|---|
| Startup time | Minutes | Seconds |
| Size | GBs | MBs |
| Isolation | Full OS | Process-level |
| Resource use | Heavy | Lightweight |
| Portability | Limited | Run anywhere Docker runs |
4. DevOps Toolchain & Workflows
DevOps isn't just infrastructure โ it's the processes and tools that enable teams to deliver software reliably. Here are the key workflow tools you'll encounter.
Git Branching Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| GitFlow | main + develop + feature/release/hotfix branches | Scheduled releases, large teams |
| GitHub Flow | main + short-lived feature branches + PRs | Continuous deployment, small teams |
| Trunk-Based | Everyone commits to main with feature flags | Mature CI/CD, high-velocity teams |
GitHub Flow (recommended for most teams):
main โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
\ /
feature/ABC โโโโโโโโโโโโโโ (PR + merge)
โ
Code reviewJira & Confluence Integration
Jira manages tasks/sprints; Confluence manages documentation. Together they form the project management backbone:
| Tool | Used For | DevOps Connection |
|---|---|---|
| Jira | Sprint planning, bug tracking, task management | Link commits/PRs to tickets (e.g., PROJ-123) |
| Confluence | Runbooks, architecture docs, postmortems | Document SOP, incident response, deployment guides |
| Jira + GitHub | Traceability | Branch name: feature/PROJ-123-add-monitoring |
Best practice: Always reference ticket IDs in branch names and commit messages. This creates an audit trail: git commit -m "feat(PROJ-123): add CloudWatch dashboard"
Incident Management Process
1. ALERT โ CloudWatch alarm / PagerDuty fires
2. RESPOND โ On-call engineer acknowledges (< 15 min)
3. COMMUNICATE โ Update Jira ticket + Slack channel
4. MITIGATE โ Rollback, scale, or hotfix
5. RESOLVE โ Confirm service restored
6. POSTMORTEM โ Blameless doc in Confluence:
- Timeline of events
- Root cause (5 Whys)
- Action items to prevent recurrence5. Generative AI for DevOps Engineers
Employers increasingly expect familiarity with AI tools. The goal isn't to replace engineering judgment โ it's to accelerate routine work and augment decision-making.
Practical AI Use Cases in DevOps
| Use Case | Example | Tools |
|---|---|---|
| Script Generation | Generate a Bash script to rotate RDS credentials | ChatGPT, Copilot, Gemini |
| Troubleshooting | Paste error logs and get root cause analysis | ChatGPT, Claude |
| IaC Generation | Describe infra in English โ get Terraform/CloudFormation | Copilot, Amazon Q |
| Documentation | Generate runbooks, READMEs, and architecture docs | ChatGPT, Gemini |
| Dashboard Analysis | Interpret CloudWatch metrics and suggest optimizations | Amazon Q, ChatGPT |
| Code Review | Review Dockerfiles and CI configs for best practices | Copilot, CodeRabbit |
Effective Prompting for DevOps
The quality of AI output depends on your prompt quality. Use this structure:
ROLE: "You are a senior DevOps engineer specializing in AWS"
CONTEXT: "I have an EKS cluster running 3 nodes with nginx ingress"
TASK: "Write a Kubernetes HPA manifest that scales based on
request rate using custom metrics"
FORMAT: "Provide a complete YAML manifest with comments explaining
each field"Example: Using AI to Generate a CloudWatch Alarm
Prompt:
You are a DevOps engineer working with AWS. I have an RDS PostgreSQL
instance called sandbox-db. Generate a CloudWatch alarm that:
- Triggers when free storage space drops below 2GB
- Sends notification to an SNS topic called sandbox-alerts
- Uses a 5-minute evaluation period
- Provide the AWS CLI command to create it.What to validate in the output:
- โ
Correct metric name and namespace (
AWS/RDS,FreeStorageSpace) - โ Correct unit (Bytes, not GB โ 2GB = 2147483648 bytes)
- โ
Correct comparison operator (
LessThanThreshold) - โ SNS topic ARN format is valid
Critical thinking is essential: AI generates plausible but sometimes wrong answers. Always verify IAM policies, security group rules, and database connection strings. Never deploy AI-generated infrastructure without review.
AI-Assisted Data Analysis
| Scenario | How AI Helps |
|---|---|
| CloudWatch metrics spike | Paste metrics data โ AI correlates patterns, suggests root cause |
| Cost optimization | Share billing data โ AI identifies underutilized resources |
| Capacity planning | Provide usage trends โ AI projects scaling needs |
| Incident postmortem | Share timeline โ AI structures postmortem doc with action items |
Behavioral Competencies
Beyond technical skills, employers value these traits when working with AI:
- Technological curiosity: Willingness to experiment with new AI tools and evaluate them honestly
- Critical thinking: Ability to assess AI output quality โ don't blindly trust, always validate
- Adaptability: Comfort with rapidly evolving tools; today's best practice may change in 6 months
- Proactive improvement: Actively identifying where AI can automate repetitive tasks in your workflow
6. Customer-Facing Technical Communication
DevOps roles in product/mobility companies often include customer training sessions and being the primary technical contact. This requires translating complex infrastructure into understandable terms.
Technical Support Tiers
| Tier | Role | DevOps Involvement |
|---|---|---|
| L1 | Help desk / initial triage | Provide runbooks and FAQs |
| L2 | Technical investigation | Diagnose via logs, metrics. Config changes |
| L3 | Engineering escalation | This is you โ root cause analysis, code fixes, infra changes |
Conducting Customer Training Sessions
- Know your audience: Developers need API details; managers need architecture overviews
- Lead with value: Start with "what this gives you" before "how it works"
- Use live demos: Show the Console, dashboards, and deployment flows in real-time
- Prepare runbooks: Document step-by-step procedures in Confluence for self-service
- Follow up: Send a summary email, record the session, create a FAQ from questions asked
Documentation standard: Every operational procedure should have a runbook in Confluence with: purpose, prerequisites, step-by-step instructions, expected output, troubleshooting section, and contact for escalation.
Key Takeaways
- Monitoring: Set up CloudWatch alarms, dashboards, and log groups before users report issues
- Troubleshooting: Follow a systematic framework: detect โ triage โ diagnose โ resolve โ postmortem
- Docker: Multi-stage builds keep images small; ECR stores them securely on AWS
- Toolchain: Jira for tracking, Confluence for docs, Git branching for code flow โ connect them all
- AI: Use generative AI to accelerate scripting, troubleshooting, and documentation โ but always validate output
- Communication: Translate technical complexity into business value for customers and stakeholders