ECS Fargate
MenoTime runs the FastAPI backend on AWS ECS Fargate, a serverless container orchestration platform that eliminates the need to manage underlying EC2 instances. This document covers task definitions, service configuration, scaling policies, container image management, and deployment strategies.
ECS Architecture
┌─────────────────────────────────────────────┐
│ ECS Clusters (3 per env) │
├─────────────────────────────────────────────┤
│ menotime-dev-cluster │
│ menotime-staging-cluster │
│ menotime-prod-cluster │
└──────────────┬──────────────────────────────┘
│
┌───────┴───────┐
│ │
┌────▼────┐ ┌────▼────┐
│ Service │ │ Service │
│(Dev) │ │(Staging)│
└────┬────┘ └────┬────┘
│ │
┌────▼────────┬─────▼────┐
│ ECS Tasks │ ECS Tasks │
│(Fargate) │(Fargate) │
└─────────────┴───────────┘
Task Definition
Development & Staging
Name: menotime-backend-{env}:latest
{
"family": "menotime-backend-dev",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "1024",
"containerDefinitions": [
{
"name": "menotime-backend",
"image": "ACCOUNT_ID.dkr.ecr.us-west-1.amazonaws.com/menotime/backend:latest",
"portMappings": [
{
"containerPort": 8000,
"hostPort": 8000,
"protocol": "tcp"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/aws/ecs/menotime-dev",
"awslogs-region": "us-west-1",
"awslogs-stream-prefix": "ecs",
"awslogs-datetime-format": "%Y-%m-%d %H:%M:%S"
}
},
"environment": [
{
"name": "ENVIRONMENT",
"value": "development"
},
{
"name": "LOG_LEVEL",
"value": "DEBUG"
},
{
"name": "API_DEBUG",
"value": "true"
}
],
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:us-west-1:ACCOUNT_ID:secret:menotime/db/dev"
},
{
"name": "API_KEY_STRIPE",
"valueFrom": "arn:aws:secretsmanager:us-west-1:ACCOUNT_ID:secret:menotime/api/stripe-test"
}
],
"healthCheck": {
"command": [
"CMD-SHELL",
"curl -f http://localhost:8000/health || exit 1"
],
"interval": 30,
"timeout": 5,
"retries": 2,
"startPeriod": 60
}
}
],
"executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::ACCOUNT_ID:role/menotime-ecs-task-role"
}
Production
Name: menotime-backend-prod:v1.2.3 (semantic versioning)
{
"family": "menotime-backend-prod",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"containerDefinitions": [
{
"name": "menotime-backend",
"image": "ACCOUNT_ID.dkr.ecr.us-west-1.amazonaws.com/menotime/backend:v1.2.3",
"portMappings": [
{
"containerPort": 8000,
"hostPort": 8000,
"protocol": "tcp"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/aws/ecs/menotime-prod",
"awslogs-region": "us-west-1",
"awslogs-stream-prefix": "ecs",
"awslogs-datetime-format": "%Y-%m-%d %H:%M:%S"
}
},
"environment": [
{
"name": "ENVIRONMENT",
"value": "production"
},
{
"name": "LOG_LEVEL",
"value": "WARNING"
},
{
"name": "API_DEBUG",
"value": "false"
},
{
"name": "SENTRY_SAMPLE_RATE",
"value": "0.1"
}
],
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:us-west-1:ACCOUNT_ID:secret:menotime/db/prod"
},
{
"name": "API_KEY_STRIPE",
"valueFrom": "arn:aws:secretsmanager:us-west-1:ACCOUNT_ID:secret:menotime/api/stripe-live"
}
],
"healthCheck": {
"command": [
"CMD-SHELL",
"curl -f http://localhost:8000/health || exit 1"
],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
],
"executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::ACCOUNT_ID:role/menotime-ecs-task-role"
}
Task Definition Fields Explained
| Field | Value | Purpose |
|---|---|---|
| family | menotime-backend-{env} |
Logical grouping of task definitions |
| networkMode | awsvpc |
Uses ENI (elastic network interface) in VPC |
| cpu | 256 (dev/staging), 1024 (prod) | vCPU allocation (256 = 0.25 vCPU) |
| memory | 1024 (dev/staging), 2048 (prod) | Memory in MB |
| containerPort | 8000 | FastAPI default port |
| logDriver | awslogs |
CloudWatch Logs |
| logGroup | /aws/ecs/menotime-{env} |
Log group for centralized logging |
| healthCheck | curl /health endpoint | Container health validation |
| executionRoleArn | Task execution role | Pull images, write logs |
| taskRoleArn | Task role | Application permissions (DB, S3, SES) |
ECS Service Configuration
Development Service
Service Name: menotime-dev-service
Cluster: menotime-dev-cluster
Task Definition: menotime-backend-dev:latest
Desired Count: 1
Deployment Strategy: All-at-once (fastest, suitable for dev)
Health Check Grace Period: 60 seconds
Load Balancer Target Group: menotime-backend-tg
Subnets: Private subnets (us-west-1a, us-west-1b)
Security Group: menotime-ecs-sg
Staging Service
Service Name: menotime-staging-service
Cluster: menotime-staging-cluster
Task Definition: menotime-backend-staging:latest
Desired Count: 2
Deployment Strategy: Rolling (1 task minimum always running)
Health Check Grace Period: 60 seconds
Load Balancer Target Group: menotime-backend-tg (shared ALB)
Deployment Configuration:
- Maximum percent: 200% (allows 2 tasks during rollout)
- Minimum healthy percent: 50% (at least 1 task running)
Subnets: Private subnets
Security Group: menotime-ecs-sg
Production Service
Service Name: menotime-prod-service
Cluster: menotime-prod-cluster
Task Definition: menotime-backend-prod:v1.2.3
Desired Count: 2
Deployment Strategy: Rolling (blue/green capable with manual triggering)
Health Check Grace Period: 60 seconds
Load Balancer Target Group: menotime-backend-tg
Deployment Configuration:
- Maximum percent: 150% (allows up to 3 tasks during rollout)
- Minimum healthy percent: 100% (minimum 2 tasks always running)
Subnets: Private subnets
Security Group: menotime-ecs-sg
Enable Circuit Breaker: Yes (automatic rollback on failed deployment)
Container Image Pipeline
Image Building & Registry
Repository: menotime/backend in ECR (Elastic Container Registry)
Image Tagging Strategy:
Development:
- branch: develop → tag: develop
- commit: a1b2c3d → tag: a1b2c3d
- PR: #42 → tag: pr-42
Staging:
- branch: main → tag: main
- commit: x9y8z7w → tag: x9y8z7w
- PR: #43 → tag: pr-43
Production:
- tag: v1.2.3 → tag: v1.2.3 (semantic version)
- tag: v1.2.3 → tag: latest (always points to most recent)
Build Process (CI/CD)
Triggered on: Push to any branch
# GitHub Actions Workflow (example)
name: Build and Push Docker Image
on:
push:
branches: [develop, main]
tags: [v*]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Login to ECR
run: |
aws ecr get-login-password --region us-west-1 | \
docker login --username AWS --password-stdin $ECR_REGISTRY
- name: Build image
run: |
docker build -t menotime/backend:${{ github.sha }} .
- name: Tag and push image
run: |
# Push with commit SHA
docker tag menotime/backend:${{ github.sha }} \
$ECR_REGISTRY/menotime/backend:${{ github.sha }}
docker push $ECR_REGISTRY/menotime/backend:${{ github.sha }}
# Push with branch tag
docker tag menotime/backend:${{ github.sha }} \
$ECR_REGISTRY/menotime/backend:${{ github.ref_name }}
docker push $ECR_REGISTRY/menotime/backend:${{ github.ref_name }}
# Push latest for main branch
if [ "${{ github.ref }}" == "refs/heads/main" ]; then
docker tag menotime/backend:${{ github.sha }} \
$ECR_REGISTRY/menotime/backend:latest
docker push $ECR_REGISTRY/menotime/backend:latest
fi
- name: Scan image for vulnerabilities
run: |
aws ecr start-image-scan \
--repository-name menotime/backend \
--image-id imageTag=${{ github.sha }}
Image Scanning
Automatic scanning on every push using Trivy (or AWS native scanning).
Findings: - Critical/High CVEs: Block deployment - Medium CVEs: Manual review required - Low CVEs: Logged for tracking
Base Image: python:3.11-slim
- Regularly patched (scan at least weekly)
- Prefer slim variant to reduce surface area
Lifecycle Policy
ECR Image Retention:
{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 10 semantic versions",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["v"],
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": {
"type": "expire"
}
},
{
"rulePriority": 2,
"description": "Delete untagged images after 30 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 30
},
"action": {
"type": "expire"
}
}
]
}
Health Checks
Container-Level Health Check
Type: HTTP endpoint polling
Endpoint: GET /health
Response (production):
{
"status": "healthy",
"version": "1.2.3",
"database": "connected",
"timestamp": "2024-02-01T12:00:00Z"
}
Configuration:
Interval: 30 seconds
Timeout: 5 seconds
Healthy Threshold: 2 consecutive successes
Unhealthy Threshold: 3 consecutive failures
Start Period: 60 seconds (grace period before first check)
What the endpoint checks: 1. FastAPI running and responding 2. Database connection status 3. Required services accessible (Secrets Manager, S3) 4. Response time \<100ms
ALB Target Group Health Check
Path: /health
Protocol: HTTP
Port: 8000
Configuration:
Interval: 30 seconds
Timeout: 5 seconds
Healthy Threshold: 2
Unhealthy Threshold: 3
Matcher: 200 (HTTP 200 response)
Deployment Health Check
During rolling deployment: 1. New task starts 2. ALB health checks start (after 60-second grace period) 3. Task must pass 2 consecutive health checks (60 seconds) 4. Old task terminates 5. Monitor for 30 minutes post-deployment
Auto-Scaling
Development
Auto-scaling: Disabled (manual scaling only)
Staging
Auto-scaling: Minimal
Min Tasks: 2
Max Tasks: 3
Target Tracking: CPU 70% / Memory 80%
Scale-up Cooldown: 60 seconds
Scale-down Cooldown: 300 seconds
Rationale: Cost control while enabling performance testing
Production
Auto-scaling: Enabled (intelligent scaling)
Min Tasks: 2 (high availability; if 1 fails, service continues)
Max Tasks: 4 (accommodate growth; cost cap)
Target Tracking Metrics:
- CPU: 60% (scale up at 60% CPU usage)
- Memory: 75% (scale up at 75% memory usage)
- ALB Request Count Per Target: 1000 (scale if avg >1000 requests/task)
Scale-up Cooldown: 60 seconds (faster to handle spikes)
Scale-down Cooldown: 600 seconds (conservative to avoid thrashing)
Scaling Behavior:
Load Increase (Patient Activity):
Task count: 2 → 3 (60% CPU trigger)
Avg response time: 200ms → 150ms (better responsiveness)
Task health: All healthy
Load Decrease (Off-hours):
Task count: 3 → 2 (30% CPU target)
Cost reduction: ~$8/hour (1 fewer vCPU)
Scale-down waits 10 minutes (avoid thrashing)
Scaling Commands
View Current Scaling:
aws application-autoscaling describe-scalable-targets \
--service-namespace ecs \
--resource-ids service/menotime-prod-cluster/menotime-prod-service
Update Scaling Policy:
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/menotime-prod-cluster/menotime-prod-service \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 \
--max-capacity 5
Deployment Strategies
Development: All-at-Once
Speed: Fastest (30 seconds) Downtime: Acceptable Rollback: Manual restart if needed
1. New image deployed
2. All tasks (1) stop immediately
3. New task with updated image starts
4. Health check passes → Ready
Staging: Rolling Deployment
Speed: Moderate (2-3 minutes) Downtime: None (minimum 1 task always running) Rollback: Automatic on health check failure
Original State:
Task 1 (running) - Task 2 (running)
Step 1:
Task 1 (new image, starting) - Task 2 (running)
Step 2:
Task 1 (new image, healthy) - Task 2 (running)
Step 3:
Task 1 (new image, running) - Task 2 (old image, stopping)
Step 4:
Task 1 (new image, running) - Task 2 (new image, starting)
Final State:
Task 1 (new image) - Task 2 (new image)
Production: Rolling Deployment with Circuit Breaker
Speed: Moderate (3-5 minutes) Downtime: None (minimum 2 tasks always running) Rollback: Automatic on health check failure (circuit breaker)
Deployment Configuration:
{
"maximumPercent": 150,
"minimumHealthyPercent": 100,
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
}
What Circuit Breaker Does: - If >50% of new tasks fail health checks → automatic rollback - Reverts to previous task definition - Alerts triggered to on-call engineer - Changes halted; no additional deployments allowed until manual review
Example Deployment Process:
Initial State (2 tasks, v1.2.2):
Task 1 [v1.2.2, healthy]
Task 2 [v1.2.2, healthy]
Deployment Starts (target: v1.2.3):
Task 1 [v1.2.2, healthy]
Task 2 [v1.2.2, healthy]
Task 3 [v1.2.3, starting]
Health Check (Task 3):
✓ Passes 2 consecutive checks → Task 3 marked healthy
Scale Down Phase 1:
Task 1 [v1.2.3, starting]
Task 2 [v1.2.2, healthy]
Task 3 [v1.2.3, healthy]
Health Check (Task 1):
✓ Passes → Task 1 ready
Remove Old Task:
Task 1 [v1.2.3, healthy]
Task 2 [v1.2.2, stopping]
Task 3 [v1.2.3, healthy]
Final State (v1.2.3):
Task 1 [v1.2.3, healthy]
Task 2 [v1.2.3, healthy]
Post-Deployment Monitoring:
• CloudWatch alarms checked for 30 minutes
• Error rates, latency, database connections
• If anomalies detected → manual rollback
Common Deployment Tasks
Manual Deployment
Update ECS Service (point to new image):
# Get the latest task definition
aws ecs describe-task-definition \
--task-definition menotime-backend-prod:latest \
--query 'taskDefinition' > task-def.json
# Update service to use new task definition
aws ecs update-service \
--cluster menotime-prod-cluster \
--service menotime-prod-service \
--task-definition menotime-backend-prod:v1.2.4 \
--force-new-deployment
Monitor Deployment Progress
# Watch deployment progress
aws ecs describe-services \
--cluster menotime-prod-cluster \
--services menotime-prod-service \
--query 'services[0].deployments'
Rollback to Previous Version
# Get previous task definition revision
aws ecs list-task-definitions \
--family-prefix menotime-backend-prod \
--sort DESCENDING \
--query 'taskDefinitionArns[1]'
# Rollback to previous version
aws ecs update-service \
--cluster menotime-prod-cluster \
--service menotime-prod-service \
--task-definition menotime-backend-prod:48 \
--force-new-deployment
View Task Logs
# Real-time logs
aws logs tail /aws/ecs/menotime-prod --follow
# Last 100 lines
aws logs tail /aws/ecs/menotime-prod --max-items 100
# Logs from specific time range
aws logs filter-log-events \
--log-group-name /aws/ecs/menotime-prod \
--start-time 1675000000000 \
--end-time 1675100000000
SSH to Running Container
# Get task ID
TASK_ID=$(aws ecs list-tasks \
--cluster menotime-prod-cluster \
--service-name menotime-prod-service \
--query 'taskArns[0]' \
--output text | cut -d'/' -f3)
# Execute command in container
aws ecs execute-command \
--cluster menotime-prod-cluster \
--task $TASK_ID \
--container menotime-backend \
--interactive \
--command "/bin/bash"
Cost Optimization
Current Pricing (Production)
vCPU (1 per task × 2 tasks): $0.04048/hour
Memory (2GB per task × 2 tasks): $0.004445/hour
Monthly Cost (730 hours): ~$120 (compute only)
Optimization Opportunities
- Right-size Dev/Staging: Already using 0.5 vCPU (lowest viable)
- Spot Instances: ECS Fargate Spot offers 70% discount (best effort, can be interrupted)
- Reserved Capacity: Fargate On-Demand discounts with 1-3 year commitments
- Off-hours Scaling: Reduce to 1 task during off-hours (needs manual schedule)
Fargate Spot Configuration (Optional for Non-Production)
{
"capacityProviders": ["FARGATE", "FARGATE_SPOT"],
"defaultCapacityProviderStrategy": [
{
"capacityProvider": "FARGATE_SPOT",
"weight": 70,
"base": 0
},
{
"capacityProvider": "FARGATE",
"weight": 30,
"base": 1
}
]
}
Interpretation: 70% Spot (cheaper, interruptible) + 30% On-Demand (reliable)
Troubleshooting
Task Won't Start
Symptoms: Task enters PENDING state and never becomes RUNNING
Causes: 1. Insufficient capacity (Fargate doesn't have vCPU available) 2. Invalid task definition (image doesn't exist) 3. Security group blocking task
Check:
# Inspect stopped task
aws ecs describe-tasks \
--cluster menotime-prod-cluster \
--tasks `<task-arn>` \
--query 'tasks[0].{lastStatus:lastStatus, stoppedReason:stoppedReason}'
# Check CloudWatch Logs
aws logs tail /aws/ecs/menotime-prod --follow
Task Health Check Failing
Symptoms: Task enters RUNNING state but ALB marks as unhealthy
Causes: 1. Container listening on wrong port (check task definition) 2. Health check endpoint unreachable (check security group) 3. Application error (check logs)
Debug:
# Get task ENI (network interface)
aws ecs describe-tasks \
--cluster menotime-prod-cluster \
--tasks `<task-arn>` \
--query 'tasks[0].attachments[0].details[?name==`networkInterfaceId`].value'
# Check security group allows 8000
aws ec2 describe-security-groups \
--group-ids `<security-group-id>` \
--query 'SecurityGroups[0].IpPermissions'
High Memory Usage
Symptoms: Memory utilization >85%, tasks OOM killed
Causes: 1. Memory leak in application (check logs) 2. Working set larger than allocated (increase task memory) 3. Database connection pool too large
Fix: 1. Update task definition memory to 3GB (staging/prod) 2. Profile application memory usage 3. Review logs for "OutOfMemory" messages
Summary
ECS Fargate provides a reliable, serverless container platform for MenoTime. Key operational points:
- Task Definitions: Dev/Staging (0.5 vCPU, 1GB) vs Prod (1 vCPU, 2GB)
- Deployments: Rolling strategy with health checks and circuit breaker
- Scaling: Auto-scaling in production (2-4 tasks); manual in dev
- Health: Continuous monitoring via CloudWatch and ALB target groups
- Troubleshooting: Logs in CloudWatch, task inspection via AWS CLI
For networking configuration, see Networking. For monitoring, see Monitoring.