Module 11

Monitoring, Docker & AI

Role-specific deep dives into application monitoring (CloudWatch), Docker & ECR, DevOps toolchains (Jira/Git), generative AI for engineering, and customer-facing communication.

CloudWatchDockerECRJiraAI ToolsRunbooks

Role-Specific Focus Areas

This module targets skills frequently requested in DevOps Engineer roles in the mobility/automotive sector โ€” covering application monitoring, operational troubleshooting, Docker workflows, DevOps toolchains, and leveraging AI for engineering productivity. These are the "glue" skills that complement your infrastructure knowledge.

๐Ÿ“Š Monitoring & Observability

CloudWatch, X-Ray, alarms, dashboards, and log analysis for production systems.

๐Ÿณ Docker Deep-Dive

Images, containers, Dockerfiles, multi-stage builds, and AWS ECR integration.

๐Ÿ”ง Toolchain & Workflows

Jira, Confluence, Git branching strategies, and incident management processes.

๐Ÿค– AI for DevOps

Using generative AI to accelerate scripting, troubleshooting, documentation, and analysis.


1. Application Monitoring & Observability

As a DevOps Engineer you're responsible for the smooth operation of the system. This means knowing when something breaks before users report it. AWS provides three key monitoring pillars:

The Three Pillars

๐Ÿ“ˆ Metrics

Numeric measurements over time โ€” CPU usage, request count, error rate. Stored in CloudWatch Metrics.

๐Ÿ“ Logs

Text records of events โ€” application output, error traces, access logs. Stored in CloudWatch Logs.

๐Ÿ” Traces

Request paths across services โ€” which microservice was slow, where the error originated. Powered by AWS X-Ray.

CloudWatch โ€” Core Concepts

ConceptWhat It DoesExample
NamespaceGroups related metricsAWS/EC2, AWS/RDS, Custom/MyApp
MetricA measurable valueCPUUtilization, RequestCount
DimensionFilters a metricInstanceId=i-abc123
PeriodAggregation window60 seconds, 5 minutes
AlarmTriggers when metric exceeds thresholdCPU > 80% for 5 minutes โ†’ send email
DashboardVisual display of metricsCombined view of EC2 + RDS + ALB health
๐Ÿงช

Console: CloudWatch โ†’ Alarms โ†’ Create alarm

Create a CPU utilization alarm for your ASG instances:

SettingValue
MetricEC2 โ†’ Per-Instance Metrics โ†’ CPUUtilization
StatisticAverage
Period5 minutes
Threshold typeStatic
ConditionGreater than 80
Datapoints to alarm2 out of 3

Notification:

SettingValue
SNS TopicCreate new topic
Topic namesandbox-alerts
Email endpointyour-email@example.com

Click Create alarm. Confirm the SNS subscription email.

๐Ÿงช

Console: CloudWatch โ†’ Dashboards โ†’ Create dashboard

Create a unified operations dashboard:

SettingValue
Dashboard namesandbox-ops

Add these widgets:

  1. Line chart โ†’ EC2 CPUUtilization (all instances)
  2. Line chart โ†’ ALB RequestCount + HTTPCode_Target_5XX_Count
  3. Number โ†’ RDS DatabaseConnections
  4. Line chart โ†’ RDS FreeStorageSpace
  5. Number โ†’ ALB HealthyHostCount / UnHealthyHostCount

CloudWatch Logs โ€” Application Logging

Centralize your application logs so you can search, filter, and alert on them:

๐Ÿงช

Console: CloudWatch โ†’ Log groups โ†’ Create log group

SettingValue
Log group name/sandbox/application
Retention7 days (to save cost)

Then install the CloudWatch Agent on your EC2 instances to ship logs (already enabled via the CloudWatchAgentServerPolicy IAM role from Module 2).

๐Ÿ’ก Tip

Key metrics to always monitor: CPU, Memory, Disk, Request rate, Error rate (5xx), Response time (latency), Database connections, Queue depth. These are the "golden signals" of observability.


2. Operational Troubleshooting

Identifying and resolving operational issues is a core DevOps responsibility. Here's a systematic approach:

The Troubleshooting Framework

text
1. DETECT   โ†’ Alarm fires or user reports issue
2. TRIAGE   โ†’ Severity? Blast radius? Who's affected?
3. DIAGNOSE โ†’ Check metrics, logs, traces, recent changes
4. RESOLVE  โ†’ Apply fix (rollback, scale, config change)
5. POSTMORTEM โ†’ Document root cause, add monitoring to prevent recurrence

Common AWS Issues & Resolution

SymptomCheckCommon Fix
ALB returns 502Target group health + app logsApp crashed โ€” restart PM2, check memory
High CPU on EC2CloudWatch CPU metric + topScale out ASG or optimize code
RDS connections maxedDatabaseConnections metricAdd connection pooling or scale instance
Deployment failedCodeDeploy events + /var/log/aws/codedeploy-agent/Fix lifecycle hook scripts; rollback
Instances not launchingASG Activity tabCheck launch template, SGs, subnet capacity
Slow response timesALB TargetResponseTimeOptimize DB queries, add caching, scale out
๐Ÿงช

Hands-On: SSM Session Manager for Live Debugging

Connect to a running instance without SSH:

  1. Go to Systems Manager โ†’ Session Manager โ†’ Start session
  2. Select your EC2 instance โ†’ click Start session
  3. You get a terminal โ€” run these diagnostic commands:
bash
# Check app status
pm2 list
pm2 logs sandbox --lines 50

# Check system resources
top -bn1 | head -20
df -h
free -m

# Check network connectivity to RDS
nc -zv sandbox-db.xxxxx.rds.amazonaws.com 5432

# Check recent deployments
ls -la /opt/codedeploy-agent/deployment-root/

# View environment variables
cat /home/ec2-user/sandbox-app/app/.env

3. Docker โ€” Container Fundamentals

Containers package your application with all its dependencies into a portable unit. Docker is the most common container runtime and a core DevOps skill.

Key Concepts

๐Ÿ“ฆ Image

A read-only blueprint. Built from a Dockerfile. Like a VM snapshot but lighter.

๐Ÿƒ Container

A running instance of an image. Isolated process with its own filesystem.

๐Ÿ“‹ Dockerfile

A script that defines how to build the image. Each instruction creates a layer.

๐Ÿ—„๏ธ Registry

A storage for images. Docker Hub (public) or AWS ECR (private).

Dockerfile Example

dockerfileDockerfile
# Multi-stage build: build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .

# Production stage: minimal image
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app .
EXPOSE 3000
USER node
CMD ["node", "server.js"]
๐Ÿ“˜ Key Concept

Multi-stage builds keep images small. The build stage has all dev tools; the production stage only has what's needed to run. This reduces image size from ~900MB to ~150MB.

Essential Docker Commands

CommandPurpose
docker build -t myapp:v1 .Build image from Dockerfile
docker run -d -p 3000:3000 myapp:v1Run container in background
docker psList running containers
docker logs -f <id>Follow container logs
docker exec -it <id> shOpen shell in container
docker stop <id>Stop a container
docker imagesList local images
docker system pruneClean up unused resources

AWS ECR โ€” Private Container Registry

Elastic Container Registry (ECR) is AWS's private Docker registry. Store images securely and pull them into ECS, EKS, or EC2.

๐Ÿงช

Console: ECR โ†’ Repositories โ†’ Create repository

SettingValue
Repository namesandbox-app
Image tag mutabilityImmutable (best practice)
Scan on pushโœ… Enabled

Click Create repository.

Push an image to ECR:

bash
# Authenticate Docker to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com

# Build and tag
docker build -t sandbox-app .
docker tag sandbox-app:latest \
  ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/sandbox-app:v1

# Push
docker push \
  ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/sandbox-app:v1

Docker vs VM Comparison

AspectVMContainer
Startup timeMinutesSeconds
SizeGBsMBs
IsolationFull OSProcess-level
Resource useHeavyLightweight
PortabilityLimitedRun anywhere Docker runs

4. DevOps Toolchain & Workflows

DevOps isn't just infrastructure โ€” it's the processes and tools that enable teams to deliver software reliably. Here are the key workflow tools you'll encounter.

Git Branching Strategies

StrategyHow It WorksBest For
GitFlowmain + develop + feature/release/hotfix branchesScheduled releases, large teams
GitHub Flowmain + short-lived feature branches + PRsContinuous deployment, small teams
Trunk-BasedEveryone commits to main with feature flagsMature CI/CD, high-velocity teams
text
GitHub Flow (recommended for most teams):

main โ”€โ”€โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”€โ”€โ”€โ”€
            \              /
feature/ABC  โ—โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”˜  (PR + merge)
                    โ†‘
              Code review

Jira & Confluence Integration

Jira manages tasks/sprints; Confluence manages documentation. Together they form the project management backbone:

ToolUsed ForDevOps Connection
JiraSprint planning, bug tracking, task managementLink commits/PRs to tickets (e.g., PROJ-123)
ConfluenceRunbooks, architecture docs, postmortemsDocument SOP, incident response, deployment guides
Jira + GitHubTraceabilityBranch name: feature/PROJ-123-add-monitoring
๐Ÿ’ก Tip

Best practice: Always reference ticket IDs in branch names and commit messages. This creates an audit trail: git commit -m "feat(PROJ-123): add CloudWatch dashboard"

Incident Management Process

text
1. ALERT    โ†’ CloudWatch alarm / PagerDuty fires
2. RESPOND  โ†’ On-call engineer acknowledges (< 15 min)
3. COMMUNICATE โ†’ Update Jira ticket + Slack channel
4. MITIGATE โ†’ Rollback, scale, or hotfix
5. RESOLVE  โ†’ Confirm service restored
6. POSTMORTEM โ†’ Blameless doc in Confluence:
               - Timeline of events
               - Root cause (5 Whys)
               - Action items to prevent recurrence

5. Generative AI for DevOps Engineers

Employers increasingly expect familiarity with AI tools. The goal isn't to replace engineering judgment โ€” it's to accelerate routine work and augment decision-making.

Practical AI Use Cases in DevOps

Use CaseExampleTools
Script GenerationGenerate a Bash script to rotate RDS credentialsChatGPT, Copilot, Gemini
TroubleshootingPaste error logs and get root cause analysisChatGPT, Claude
IaC GenerationDescribe infra in English โ†’ get Terraform/CloudFormationCopilot, Amazon Q
DocumentationGenerate runbooks, READMEs, and architecture docsChatGPT, Gemini
Dashboard AnalysisInterpret CloudWatch metrics and suggest optimizationsAmazon Q, ChatGPT
Code ReviewReview Dockerfiles and CI configs for best practicesCopilot, CodeRabbit

Effective Prompting for DevOps

The quality of AI output depends on your prompt quality. Use this structure:

text
ROLE:    "You are a senior DevOps engineer specializing in AWS"
CONTEXT: "I have an EKS cluster running 3 nodes with nginx ingress"
TASK:    "Write a Kubernetes HPA manifest that scales based on
         request rate using custom metrics"
FORMAT:  "Provide a complete YAML manifest with comments explaining
         each field"
๐Ÿงช

Example: Using AI to Generate a CloudWatch Alarm

Prompt:

text
You are a DevOps engineer working with AWS. I have an RDS PostgreSQL
instance called sandbox-db. Generate a CloudWatch alarm that:
- Triggers when free storage space drops below 2GB
- Sends notification to an SNS topic called sandbox-alerts
- Uses a 5-minute evaluation period
- Provide the AWS CLI command to create it.

What to validate in the output:

  • โœ… Correct metric name and namespace (AWS/RDS, FreeStorageSpace)
  • โœ… Correct unit (Bytes, not GB โ€” 2GB = 2147483648 bytes)
  • โœ… Correct comparison operator (LessThanThreshold)
  • โœ… SNS topic ARN format is valid
โš ๏ธ Warning

Critical thinking is essential: AI generates plausible but sometimes wrong answers. Always verify IAM policies, security group rules, and database connection strings. Never deploy AI-generated infrastructure without review.

AI-Assisted Data Analysis

ScenarioHow AI Helps
CloudWatch metrics spikePaste metrics data โ†’ AI correlates patterns, suggests root cause
Cost optimizationShare billing data โ†’ AI identifies underutilized resources
Capacity planningProvide usage trends โ†’ AI projects scaling needs
Incident postmortemShare timeline โ†’ AI structures postmortem doc with action items

Behavioral Competencies

Beyond technical skills, employers value these traits when working with AI:

  • Technological curiosity: Willingness to experiment with new AI tools and evaluate them honestly
  • Critical thinking: Ability to assess AI output quality โ€” don't blindly trust, always validate
  • Adaptability: Comfort with rapidly evolving tools; today's best practice may change in 6 months
  • Proactive improvement: Actively identifying where AI can automate repetitive tasks in your workflow

6. Customer-Facing Technical Communication

DevOps roles in product/mobility companies often include customer training sessions and being the primary technical contact. This requires translating complex infrastructure into understandable terms.

Technical Support Tiers

TierRoleDevOps Involvement
L1Help desk / initial triageProvide runbooks and FAQs
L2Technical investigationDiagnose via logs, metrics. Config changes
L3Engineering escalationThis is you โ€” root cause analysis, code fixes, infra changes

Conducting Customer Training Sessions

  • Know your audience: Developers need API details; managers need architecture overviews
  • Lead with value: Start with "what this gives you" before "how it works"
  • Use live demos: Show the Console, dashboards, and deployment flows in real-time
  • Prepare runbooks: Document step-by-step procedures in Confluence for self-service
  • Follow up: Send a summary email, record the session, create a FAQ from questions asked
๐Ÿ’ก Tip

Documentation standard: Every operational procedure should have a runbook in Confluence with: purpose, prerequisites, step-by-step instructions, expected output, troubleshooting section, and contact for escalation.


Key Takeaways

  • Monitoring: Set up CloudWatch alarms, dashboards, and log groups before users report issues
  • Troubleshooting: Follow a systematic framework: detect โ†’ triage โ†’ diagnose โ†’ resolve โ†’ postmortem
  • Docker: Multi-stage builds keep images small; ECR stores them securely on AWS
  • Toolchain: Jira for tracking, Confluence for docs, Git branching for code flow โ€” connect them all
  • AI: Use generative AI to accelerate scripting, troubleshooting, and documentation โ€” but always validate output
  • Communication: Translate technical complexity into business value for customers and stakeholders