Module 11

Monitoring, Docker & AI

Role-specific deep dives into application monitoring (CloudWatch), Docker & ECR, DevOps toolchains (Jira/Git), generative AI for engineering, and customer-facing communication.

CloudWatchDockerECRJiraAI ToolsRunbooks

Role-Specific Focus Areas

This module targets skills frequently requested in DevOps Engineer roles in the mobility/automotive sector — covering application monitoring, operational troubleshooting, Docker workflows, DevOps toolchains, and leveraging AI for engineering productivity. These are the "glue" skills that complement your infrastructure knowledge.

📊 Monitoring & Observability

CloudWatch, X-Ray, alarms, dashboards, and log analysis for production systems.

🐳 Docker Deep-Dive

Images, containers, Dockerfiles, multi-stage builds, and AWS ECR integration.

🔧 Toolchain & Workflows

Jira, Confluence, Git branching strategies, and incident management processes.

🤖 AI for DevOps

Using generative AI to accelerate scripting, troubleshooting, documentation, and analysis.

1. Application Monitoring & Observability

As a DevOps Engineer you're responsible for the smooth operation of the system. This means knowing when something breaks before users report it. AWS provides three key monitoring pillars:

The Three Pillars

📈 Metrics

Numeric measurements over time — CPU usage, request count, error rate. Stored in CloudWatch Metrics.

📝 Logs

Text records of events — application output, error traces, access logs. Stored in CloudWatch Logs.

🔍 Traces

Request paths across services — which microservice was slow, where the error originated. Powered by AWS X-Ray.

CloudWatch — Core Concepts

Concept	What It Does	Example
Namespace	Groups related metrics	`AWS/EC2`, `AWS/RDS`, `Custom/MyApp`
Metric	A measurable value	`CPUUtilization`, `RequestCount`
Dimension	Filters a metric	`InstanceId=i-abc123`
Period	Aggregation window	60 seconds, 5 minutes
Alarm	Triggers when metric exceeds threshold	CPU > 80% for 5 minutes → send email
Dashboard	Visual display of metrics	Combined view of EC2 + RDS + ALB health

🧪

Console: CloudWatch → Alarms → Create alarm

Create a CPU utilization alarm for your ASG instances:

Setting	Value
Metric	`EC2` → `Per-Instance Metrics` → `CPUUtilization`
Statistic	Average
Period	`5 minutes`
Threshold type	Static
Condition	Greater than `80`
Datapoints to alarm	`2 out of 3`

Notification:

Setting	Value
SNS Topic	Create new topic
Topic name	`sandbox-alerts`
Email endpoint	your-email@example.com

Click Create alarm. Confirm the SNS subscription email.

🧪

Console: CloudWatch → Dashboards → Create dashboard

Create a unified operations dashboard:

Setting	Value
Dashboard name	`sandbox-ops`

Add these widgets:

Line chart → EC2 CPUUtilization (all instances)
Line chart → ALB RequestCount + HTTPCode_Target_5XX_Count
Number → RDS DatabaseConnections
Line chart → RDS FreeStorageSpace
Number → ALB HealthyHostCount / UnHealthyHostCount

CloudWatch Logs — Application Logging

Centralize your application logs so you can search, filter, and alert on them:

🧪

Console: CloudWatch → Log groups → Create log group

Setting	Value
Log group name	`/sandbox/application`
Retention	`7 days` (to save cost)

Then install the CloudWatch Agent on your EC2 instances to ship logs (already enabled via the CloudWatchAgentServerPolicy IAM role from Module 2).

💡 Tip

Key metrics to always monitor: CPU, Memory, Disk, Request rate, Error rate (5xx), Response time (latency), Database connections, Queue depth. These are the "golden signals" of observability.

2. Operational Troubleshooting

Identifying and resolving operational issues is a core DevOps responsibility. Here's a systematic approach:

The Troubleshooting Framework

text

1. DETECT   → Alarm fires or user reports issue
2. TRIAGE   → Severity? Blast radius? Who's affected?
3. DIAGNOSE → Check metrics, logs, traces, recent changes
4. RESOLVE  → Apply fix (rollback, scale, config change)
5. POSTMORTEM → Document root cause, add monitoring to prevent recurrence

Common AWS Issues & Resolution

Symptom	Check	Common Fix
ALB returns `502`	Target group health + app logs	App crashed — restart PM2, check memory
High CPU on EC2	CloudWatch CPU metric + `top`	Scale out ASG or optimize code
RDS connections maxed	`DatabaseConnections` metric	Add connection pooling or scale instance
Deployment failed	CodeDeploy events + `/var/log/aws/codedeploy-agent/`	Fix lifecycle hook scripts; rollback
Instances not launching	ASG Activity tab	Check launch template, SGs, subnet capacity
Slow response times	ALB `TargetResponseTime`	Optimize DB queries, add caching, scale out

🧪

Hands-On: SSM Session Manager for Live Debugging

Connect to a running instance without SSH:

Go to Systems Manager → Session Manager → Start session
Select your EC2 instance → click Start session
You get a terminal — run these diagnostic commands:

bash

# Check app status
pm2 list
pm2 logs sandbox --lines 50

# Check system resources
top -bn1 | head -20
df -h
free -m

# Check network connectivity to RDS
nc -zv sandbox-db.xxxxx.rds.amazonaws.com 5432

# Check recent deployments
ls -la /opt/codedeploy-agent/deployment-root/

# View environment variables
cat /home/ec2-user/sandbox-app/app/.env

3. Docker — Container Fundamentals

Containers package your application with all its dependencies into a portable unit. Docker is the most common container runtime and a core DevOps skill.

Key Concepts

📦 Image

A read-only blueprint. Built from a Dockerfile. Like a VM snapshot but lighter.

🏃 Container

A running instance of an image. Isolated process with its own filesystem.

📋 Dockerfile

A script that defines how to build the image. Each instruction creates a layer.

🗄️ Registry

A storage for images. Docker Hub (public) or AWS ECR (private).

Dockerfile Example

dockerfileDockerfile

# Multi-stage build: build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .

# Production stage: minimal image
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app .
EXPOSE 3000
USER node
CMD ["node", "server.js"]

📘 Key Concept

Multi-stage builds keep images small. The build stage has all dev tools; the production stage only has what's needed to run. This reduces image size from ~900MB to ~150MB.

Essential Docker Commands

Command	Purpose
`docker build -t myapp:v1 .`	Build image from Dockerfile
`docker run -d -p 3000:3000 myapp:v1`	Run container in background
`docker ps`	List running containers
`docker logs -f <id>`	Follow container logs
`docker exec -it <id> sh`	Open shell in container
`docker stop <id>`	Stop a container
`docker images`	List local images
`docker system prune`	Clean up unused resources

AWS ECR — Private Container Registry

Elastic Container Registry (ECR) is AWS's private Docker registry. Store images securely and pull them into ECS, EKS, or EC2.

🧪

Console: ECR → Repositories → Create repository

Setting	Value
Repository name	`sandbox-app`
Image tag mutability	Immutable (best practice)
Scan on push	✅ Enabled

Click Create repository.

Push an image to ECR:

bash

# Authenticate Docker to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com

# Build and tag
docker build -t sandbox-app .
docker tag sandbox-app:latest \
  ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/sandbox-app:v1

# Push
docker push \
  ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/sandbox-app:v1

Docker vs VM Comparison

Aspect	VM	Container
Startup time	Minutes	Seconds
Size	GBs	MBs
Isolation	Full OS	Process-level
Resource use	Heavy	Lightweight
Portability	Limited	Run anywhere Docker runs

4. DevOps Toolchain & Workflows

DevOps isn't just infrastructure — it's the processes and tools that enable teams to deliver software reliably. Here are the key workflow tools you'll encounter.

Git Branching Strategies

Strategy	How It Works	Best For
GitFlow	`main` + `develop` + feature/release/hotfix branches	Scheduled releases, large teams
GitHub Flow	`main` + short-lived feature branches + PRs	Continuous deployment, small teams
Trunk-Based	Everyone commits to `main` with feature flags	Mature CI/CD, high-velocity teams

text

GitHub Flow (recommended for most teams):

main ──────●──────●──────●──────●──────
            \              /
feature/ABC  ●────●────●──┘  (PR + merge)
                    ↑
              Code review

Jira & Confluence Integration

Jira manages tasks/sprints; Confluence manages documentation. Together they form the project management backbone:

Tool	Used For	DevOps Connection
Jira	Sprint planning, bug tracking, task management	Link commits/PRs to tickets (e.g., `PROJ-123`)
Confluence	Runbooks, architecture docs, postmortems	Document SOP, incident response, deployment guides
Jira + GitHub	Traceability	Branch name: `feature/PROJ-123-add-monitoring`

💡 Tip

Best practice: Always reference ticket IDs in branch names and commit messages. This creates an audit trail: git commit -m "feat(PROJ-123): add CloudWatch dashboard"

Incident Management Process

text

1. ALERT    → CloudWatch alarm / PagerDuty fires
2. RESPOND  → On-call engineer acknowledges (< 15 min)
3. COMMUNICATE → Update Jira ticket + Slack channel
4. MITIGATE → Rollback, scale, or hotfix
5. RESOLVE  → Confirm service restored
6. POSTMORTEM → Blameless doc in Confluence:
               - Timeline of events
               - Root cause (5 Whys)
               - Action items to prevent recurrence

5. Generative AI for DevOps Engineers

Employers increasingly expect familiarity with AI tools. The goal isn't to replace engineering judgment — it's to accelerate routine work and augment decision-making.

Practical AI Use Cases in DevOps

Use Case	Example	Tools
Script Generation	Generate a Bash script to rotate RDS credentials	ChatGPT, Copilot, Gemini
Troubleshooting	Paste error logs and get root cause analysis	ChatGPT, Claude
IaC Generation	Describe infra in English → get Terraform/CloudFormation	Copilot, Amazon Q
Documentation	Generate runbooks, READMEs, and architecture docs	ChatGPT, Gemini
Dashboard Analysis	Interpret CloudWatch metrics and suggest optimizations	Amazon Q, ChatGPT
Code Review	Review Dockerfiles and CI configs for best practices	Copilot, CodeRabbit

Effective Prompting for DevOps

The quality of AI output depends on your prompt quality. Use this structure:

text

ROLE:    "You are a senior DevOps engineer specializing in AWS"
CONTEXT: "I have an EKS cluster running 3 nodes with nginx ingress"
TASK:    "Write a Kubernetes HPA manifest that scales based on
         request rate using custom metrics"
FORMAT:  "Provide a complete YAML manifest with comments explaining
         each field"

🧪

Example: Using AI to Generate a CloudWatch Alarm

Prompt:

text

You are a DevOps engineer working with AWS. I have an RDS PostgreSQL
instance called sandbox-db. Generate a CloudWatch alarm that:
- Triggers when free storage space drops below 2GB
- Sends notification to an SNS topic called sandbox-alerts
- Uses a 5-minute evaluation period
- Provide the AWS CLI command to create it.

What to validate in the output:

✅ Correct metric name and namespace (AWS/RDS, FreeStorageSpace)
✅ Correct unit (Bytes, not GB — 2GB = 2147483648 bytes)
✅ Correct comparison operator (LessThanThreshold)
✅ SNS topic ARN format is valid

⚠️ Warning

Critical thinking is essential: AI generates plausible but sometimes wrong answers. Always verify IAM policies, security group rules, and database connection strings. Never deploy AI-generated infrastructure without review.

AI-Assisted Data Analysis

Scenario	How AI Helps
CloudWatch metrics spike	Paste metrics data → AI correlates patterns, suggests root cause
Cost optimization	Share billing data → AI identifies underutilized resources
Capacity planning	Provide usage trends → AI projects scaling needs
Incident postmortem	Share timeline → AI structures postmortem doc with action items

Behavioral Competencies

Beyond technical skills, employers value these traits when working with AI:

Technological curiosity: Willingness to experiment with new AI tools and evaluate them honestly
Critical thinking: Ability to assess AI output quality — don't blindly trust, always validate
Adaptability: Comfort with rapidly evolving tools; today's best practice may change in 6 months
Proactive improvement: Actively identifying where AI can automate repetitive tasks in your workflow

6. Customer-Facing Technical Communication

DevOps roles in product/mobility companies often include customer training sessions and being the primary technical contact. This requires translating complex infrastructure into understandable terms.

Technical Support Tiers

Tier	Role	DevOps Involvement
L1	Help desk / initial triage	Provide runbooks and FAQs
L2	Technical investigation	Diagnose via logs, metrics. Config changes
L3	Engineering escalation	This is you — root cause analysis, code fixes, infra changes

Conducting Customer Training Sessions

Know your audience: Developers need API details; managers need architecture overviews
Lead with value: Start with "what this gives you" before "how it works"
Use live demos: Show the Console, dashboards, and deployment flows in real-time
Prepare runbooks: Document step-by-step procedures in Confluence for self-service
Follow up: Send a summary email, record the session, create a FAQ from questions asked

💡 Tip

Documentation standard: Every operational procedure should have a runbook in Confluence with: purpose, prerequisites, step-by-step instructions, expected output, troubleshooting section, and contact for escalation.

Key Takeaways

Monitoring: Set up CloudWatch alarms, dashboards, and log groups before users report issues
Troubleshooting: Follow a systematic framework: detect → triage → diagnose → resolve → postmortem
Docker: Multi-stage builds keep images small; ECR stores them securely on AWS
Toolchain: Jira for tracking, Confluence for docs, Git branching for code flow — connect them all
AI: Use generative AI to accelerate scripting, troubleshooting, and documentation — but always validate output
Communication: Translate technical complexity into business value for customers and stakeholders