ECS Fargate

MenoTime runs the FastAPI backend on AWS ECS Fargate, a serverless container orchestration platform that eliminates the need to manage underlying EC2 instances. This document covers task definitions, service configuration, scaling policies, container image management, and deployment strategies.

ECS Architecture

┌─────────────────────────────────────────────┐
│         ECS Clusters (3 per env)            │
├─────────────────────────────────────────────┤
│  menotime-dev-cluster                       │
│  menotime-staging-cluster                   │
│  menotime-prod-cluster                      │
└──────────────┬──────────────────────────────┘
               │
       ┌───────┴───────┐
       │               │
  ┌────▼────┐     ┌────▼────┐
  │ Service │     │ Service │
  │(Dev)    │     │(Staging)│
  └────┬────┘     └────┬────┘
       │               │
  ┌────▼────────┬─────▼────┐
  │ ECS Tasks   │ ECS Tasks │
  │(Fargate)    │(Fargate)  │
  └─────────────┴───────────┘

Task Definition

Development & Staging

Name: menotime-backend-{env}:latest

{
  "family": "menotime-backend-dev",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "menotime-backend",
      "image": "ACCOUNT_ID.dkr.ecr.us-west-1.amazonaws.com/menotime/backend:latest",
      "portMappings": [
        {
          "containerPort": 8000,
          "hostPort": 8000,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/aws/ecs/menotime-dev",
          "awslogs-region": "us-west-1",
          "awslogs-stream-prefix": "ecs",
          "awslogs-datetime-format": "%Y-%m-%d %H:%M:%S"
        }
      },
      "environment": [
        {
          "name": "ENVIRONMENT",
          "value": "development"
        },
        {
          "name": "LOG_LEVEL",
          "value": "DEBUG"
        },
        {
          "name": "API_DEBUG",
          "value": "true"
        }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:us-west-1:ACCOUNT_ID:secret:menotime/db/dev"
        },
        {
          "name": "API_KEY_STRIPE",
          "valueFrom": "arn:aws:secretsmanager:us-west-1:ACCOUNT_ID:secret:menotime/api/stripe-test"
        }
      ],
      "healthCheck": {
        "command": [
          "CMD-SHELL",
          "curl -f http://localhost:8000/health || exit 1"
        ],
        "interval": 30,
        "timeout": 5,
        "retries": 2,
        "startPeriod": 60
      }
    }
  ],
  "executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::ACCOUNT_ID:role/menotime-ecs-task-role"
}

Production

Name: menotime-backend-prod:v1.2.3 (semantic versioning)

{
  "family": "menotime-backend-prod",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [
    {
      "name": "menotime-backend",
      "image": "ACCOUNT_ID.dkr.ecr.us-west-1.amazonaws.com/menotime/backend:v1.2.3",
      "portMappings": [
        {
          "containerPort": 8000,
          "hostPort": 8000,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/aws/ecs/menotime-prod",
          "awslogs-region": "us-west-1",
          "awslogs-stream-prefix": "ecs",
          "awslogs-datetime-format": "%Y-%m-%d %H:%M:%S"
        }
      },
      "environment": [
        {
          "name": "ENVIRONMENT",
          "value": "production"
        },
        {
          "name": "LOG_LEVEL",
          "value": "WARNING"
        },
        {
          "name": "API_DEBUG",
          "value": "false"
        },
        {
          "name": "SENTRY_SAMPLE_RATE",
          "value": "0.1"
        }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:us-west-1:ACCOUNT_ID:secret:menotime/db/prod"
        },
        {
          "name": "API_KEY_STRIPE",
          "valueFrom": "arn:aws:secretsmanager:us-west-1:ACCOUNT_ID:secret:menotime/api/stripe-live"
        }
      ],
      "healthCheck": {
        "command": [
          "CMD-SHELL",
          "curl -f http://localhost:8000/health || exit 1"
        ],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ],
  "executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::ACCOUNT_ID:role/menotime-ecs-task-role"
}

Task Definition Fields Explained

Field	Value	Purpose
family	`menotime-backend-{env}`	Logical grouping of task definitions
networkMode	`awsvpc`	Uses ENI (elastic network interface) in VPC
cpu	256 (dev/staging), 1024 (prod)	vCPU allocation (256 = 0.25 vCPU)
memory	1024 (dev/staging), 2048 (prod)	Memory in MB
containerPort	8000	FastAPI default port
logDriver	`awslogs`	CloudWatch Logs
logGroup	`/aws/ecs/menotime-{env}`	Log group for centralized logging
healthCheck	curl /health endpoint	Container health validation
executionRoleArn	Task execution role	Pull images, write logs
taskRoleArn	Task role	Application permissions (DB, S3, SES)

ECS Service Configuration

Development Service

Service Name: menotime-dev-service
Cluster: menotime-dev-cluster
Task Definition: menotime-backend-dev:latest
Desired Count: 1
Deployment Strategy: All-at-once (fastest, suitable for dev)
Health Check Grace Period: 60 seconds
Load Balancer Target Group: menotime-backend-tg
Subnets: Private subnets (us-west-1a, us-west-1b)
Security Group: menotime-ecs-sg

Staging Service

Service Name: menotime-staging-service
Cluster: menotime-staging-cluster
Task Definition: menotime-backend-staging:latest
Desired Count: 2
Deployment Strategy: Rolling (1 task minimum always running)
Health Check Grace Period: 60 seconds
Load Balancer Target Group: menotime-backend-tg (shared ALB)
Deployment Configuration:
  - Maximum percent: 200% (allows 2 tasks during rollout)
  - Minimum healthy percent: 50% (at least 1 task running)
Subnets: Private subnets
Security Group: menotime-ecs-sg

Production Service

Service Name: menotime-prod-service
Cluster: menotime-prod-cluster
Task Definition: menotime-backend-prod:v1.2.3
Desired Count: 2
Deployment Strategy: Rolling (blue/green capable with manual triggering)
Health Check Grace Period: 60 seconds
Load Balancer Target Group: menotime-backend-tg
Deployment Configuration:
  - Maximum percent: 150% (allows up to 3 tasks during rollout)
  - Minimum healthy percent: 100% (minimum 2 tasks always running)
Subnets: Private subnets
Security Group: menotime-ecs-sg
Enable Circuit Breaker: Yes (automatic rollback on failed deployment)

Container Image Pipeline

Image Building & Registry

Repository: menotime/backend in ECR (Elastic Container Registry)

Image Tagging Strategy:

Development:
  - branch: develop → tag: develop
  - commit: a1b2c3d → tag: a1b2c3d
  - PR: #42 → tag: pr-42

Staging:
  - branch: main → tag: main
  - commit: x9y8z7w → tag: x9y8z7w
  - PR: #43 → tag: pr-43

Production:
  - tag: v1.2.3 → tag: v1.2.3 (semantic version)
  - tag: v1.2.3 → tag: latest (always points to most recent)

Build Process (CI/CD)

Triggered on: Push to any branch

# GitHub Actions Workflow (example)
name: Build and Push Docker Image

on:
  push:
    branches: [develop, main]
    tags: [v*]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Login to ECR
        run: |
          aws ecr get-login-password --region us-west-1 | \
            docker login --username AWS --password-stdin $ECR_REGISTRY

      - name: Build image
        run: |
          docker build -t menotime/backend:${{ github.sha }} .

      - name: Tag and push image
        run: |
          # Push with commit SHA
          docker tag menotime/backend:${{ github.sha }} \
            $ECR_REGISTRY/menotime/backend:${{ github.sha }}
          docker push $ECR_REGISTRY/menotime/backend:${{ github.sha }}

          # Push with branch tag
          docker tag menotime/backend:${{ github.sha }} \
            $ECR_REGISTRY/menotime/backend:${{ github.ref_name }}
          docker push $ECR_REGISTRY/menotime/backend:${{ github.ref_name }}

          # Push latest for main branch
          if [ "${{ github.ref }}" == "refs/heads/main" ]; then
            docker tag menotime/backend:${{ github.sha }} \
              $ECR_REGISTRY/menotime/backend:latest
            docker push $ECR_REGISTRY/menotime/backend:latest
          fi

      - name: Scan image for vulnerabilities
        run: |
          aws ecr start-image-scan \
            --repository-name menotime/backend \
            --image-id imageTag=${{ github.sha }}

Image Scanning

Automatic scanning on every push using Trivy (or AWS native scanning).

Findings: - Critical/High CVEs: Block deployment - Medium CVEs: Manual review required - Low CVEs: Logged for tracking

Base Image: python:3.11-slim - Regularly patched (scan at least weekly) - Prefer slim variant to reduce surface area

Lifecycle Policy

ECR Image Retention:

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Keep last 10 semantic versions",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["v"],
        "countType": "imageCountMoreThan",
        "countNumber": 10
      },
      "action": {
        "type": "expire"
      }
    },
    {
      "rulePriority": 2,
      "description": "Delete untagged images after 30 days",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 30
      },
      "action": {
        "type": "expire"
      }
    }
  ]
}

Health Checks

Container-Level Health Check

Type: HTTP endpoint polling

Endpoint: GET /health

Response (production):

{
  "status": "healthy",
  "version": "1.2.3",
  "database": "connected",
  "timestamp": "2024-02-01T12:00:00Z"
}

Configuration:

Interval: 30 seconds
Timeout: 5 seconds
Healthy Threshold: 2 consecutive successes
Unhealthy Threshold: 3 consecutive failures
Start Period: 60 seconds (grace period before first check)

What the endpoint checks: 1. FastAPI running and responding 2. Database connection status 3. Required services accessible (Secrets Manager, S3) 4. Response time \<100ms

ALB Target Group Health Check

Path: /health

Protocol: HTTP

Port: 8000

Configuration:

Interval: 30 seconds
Timeout: 5 seconds
Healthy Threshold: 2
Unhealthy Threshold: 3
Matcher: 200 (HTTP 200 response)

Deployment Health Check

During rolling deployment: 1. New task starts 2. ALB health checks start (after 60-second grace period) 3. Task must pass 2 consecutive health checks (60 seconds) 4. Old task terminates 5. Monitor for 30 minutes post-deployment

Auto-Scaling

Development

Auto-scaling: Disabled (manual scaling only)

Staging

Auto-scaling: Minimal

Min Tasks: 2
Max Tasks: 3
Target Tracking: CPU 70% / Memory 80%
Scale-up Cooldown: 60 seconds
Scale-down Cooldown: 300 seconds

Rationale: Cost control while enabling performance testing

Production

Auto-scaling: Enabled (intelligent scaling)

Min Tasks: 2 (high availability; if 1 fails, service continues)
Max Tasks: 4 (accommodate growth; cost cap)
Target Tracking Metrics:
  - CPU: 60% (scale up at 60% CPU usage)
  - Memory: 75% (scale up at 75% memory usage)
  - ALB Request Count Per Target: 1000 (scale if avg >1000 requests/task)

Scale-up Cooldown: 60 seconds (faster to handle spikes)
Scale-down Cooldown: 600 seconds (conservative to avoid thrashing)

Scaling Behavior:

Load Increase (Patient Activity):
  Task count: 2 → 3 (60% CPU trigger)
  Avg response time: 200ms → 150ms (better responsiveness)
  Task health: All healthy

Load Decrease (Off-hours):
  Task count: 3 → 2 (30% CPU target)
  Cost reduction: ~$8/hour (1 fewer vCPU)
  Scale-down waits 10 minutes (avoid thrashing)

Scaling Commands

View Current Scaling:

aws application-autoscaling describe-scalable-targets \
  --service-namespace ecs \
  --resource-ids service/menotime-prod-cluster/menotime-prod-service

Update Scaling Policy:

aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/menotime-prod-cluster/menotime-prod-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 5

Deployment Strategies

Development: All-at-Once

Speed: Fastest (30 seconds) Downtime: Acceptable Rollback: Manual restart if needed

1. New image deployed
2. All tasks (1) stop immediately
3. New task with updated image starts
4. Health check passes → Ready

Staging: Rolling Deployment

Speed: Moderate (2-3 minutes) Downtime: None (minimum 1 task always running) Rollback: Automatic on health check failure

Original State:
  Task 1 (running) - Task 2 (running)

Step 1:
  Task 1 (new image, starting) - Task 2 (running)

Step 2:
  Task 1 (new image, healthy) - Task 2 (running)

Step 3:
  Task 1 (new image, running) - Task 2 (old image, stopping)

Step 4:
  Task 1 (new image, running) - Task 2 (new image, starting)

Final State:
  Task 1 (new image) - Task 2 (new image)

Production: Rolling Deployment with Circuit Breaker

Speed: Moderate (3-5 minutes) Downtime: None (minimum 2 tasks always running) Rollback: Automatic on health check failure (circuit breaker)

Deployment Configuration:

{
  "maximumPercent": 150,
  "minimumHealthyPercent": 100,
  "deploymentCircuitBreaker": {
    "enable": true,
    "rollback": true
  }
}

What Circuit Breaker Does: - If >50% of new tasks fail health checks → automatic rollback - Reverts to previous task definition - Alerts triggered to on-call engineer - Changes halted; no additional deployments allowed until manual review

Example Deployment Process:

Initial State (2 tasks, v1.2.2):
  Task 1 [v1.2.2, healthy]
  Task 2 [v1.2.2, healthy]

Deployment Starts (target: v1.2.3):
  Task 1 [v1.2.2, healthy]
  Task 2 [v1.2.2, healthy]
  Task 3 [v1.2.3, starting]

Health Check (Task 3):
  ✓ Passes 2 consecutive checks → Task 3 marked healthy

Scale Down Phase 1:
  Task 1 [v1.2.3, starting]
  Task 2 [v1.2.2, healthy]
  Task 3 [v1.2.3, healthy]

Health Check (Task 1):
  ✓ Passes → Task 1 ready

Remove Old Task:
  Task 1 [v1.2.3, healthy]
  Task 2 [v1.2.2, stopping]
  Task 3 [v1.2.3, healthy]

Final State (v1.2.3):
  Task 1 [v1.2.3, healthy]
  Task 2 [v1.2.3, healthy]

Post-Deployment Monitoring:
  • CloudWatch alarms checked for 30 minutes
  • Error rates, latency, database connections
  • If anomalies detected → manual rollback

Common Deployment Tasks

Manual Deployment

Update ECS Service (point to new image):

# Get the latest task definition
aws ecs describe-task-definition \
  --task-definition menotime-backend-prod:latest \
  --query 'taskDefinition' > task-def.json

# Update service to use new task definition
aws ecs update-service \
  --cluster menotime-prod-cluster \
  --service menotime-prod-service \
  --task-definition menotime-backend-prod:v1.2.4 \
  --force-new-deployment

Monitor Deployment Progress

# Watch deployment progress
aws ecs describe-services \
  --cluster menotime-prod-cluster \
  --services menotime-prod-service \
  --query 'services[0].deployments'

Rollback to Previous Version

# Get previous task definition revision
aws ecs list-task-definitions \
  --family-prefix menotime-backend-prod \
  --sort DESCENDING \
  --query 'taskDefinitionArns[1]'

# Rollback to previous version
aws ecs update-service \
  --cluster menotime-prod-cluster \
  --service menotime-prod-service \
  --task-definition menotime-backend-prod:48 \
  --force-new-deployment

View Task Logs

# Real-time logs
aws logs tail /aws/ecs/menotime-prod --follow

# Last 100 lines
aws logs tail /aws/ecs/menotime-prod --max-items 100

# Logs from specific time range
aws logs filter-log-events \
  --log-group-name /aws/ecs/menotime-prod \
  --start-time 1675000000000 \
  --end-time 1675100000000

SSH to Running Container

# Get task ID
TASK_ID=$(aws ecs list-tasks \
  --cluster menotime-prod-cluster \
  --service-name menotime-prod-service \
  --query 'taskArns[0]' \
  --output text | cut -d'/' -f3)

# Execute command in container
aws ecs execute-command \
  --cluster menotime-prod-cluster \
  --task $TASK_ID \
  --container menotime-backend \
  --interactive \
  --command "/bin/bash"

Cost Optimization

Current Pricing (Production)

vCPU (1 per task × 2 tasks): $0.04048/hour
Memory (2GB per task × 2 tasks): $0.004445/hour
Monthly Cost (730 hours): ~$120 (compute only)

Optimization Opportunities

Right-size Dev/Staging: Already using 0.5 vCPU (lowest viable)
Spot Instances: ECS Fargate Spot offers 70% discount (best effort, can be interrupted)
Reserved Capacity: Fargate On-Demand discounts with 1-3 year commitments
Off-hours Scaling: Reduce to 1 task during off-hours (needs manual schedule)

Fargate Spot Configuration (Optional for Non-Production)

{
  "capacityProviders": ["FARGATE", "FARGATE_SPOT"],
  "defaultCapacityProviderStrategy": [
    {
      "capacityProvider": "FARGATE_SPOT",
      "weight": 70,
      "base": 0
    },
    {
      "capacityProvider": "FARGATE",
      "weight": 30,
      "base": 1
    }
  ]
}

Interpretation: 70% Spot (cheaper, interruptible) + 30% On-Demand (reliable)

Troubleshooting

Task Won't Start

Symptoms: Task enters PENDING state and never becomes RUNNING

Causes: 1. Insufficient capacity (Fargate doesn't have vCPU available) 2. Invalid task definition (image doesn't exist) 3. Security group blocking task

Check:

# Inspect stopped task
aws ecs describe-tasks \
  --cluster menotime-prod-cluster \
  --tasks `<task-arn>` \
  --query 'tasks[0].{lastStatus:lastStatus, stoppedReason:stoppedReason}'

# Check CloudWatch Logs
aws logs tail /aws/ecs/menotime-prod --follow

Task Health Check Failing

Symptoms: Task enters RUNNING state but ALB marks as unhealthy

Causes: 1. Container listening on wrong port (check task definition) 2. Health check endpoint unreachable (check security group) 3. Application error (check logs)

Debug:

# Get task ENI (network interface)
aws ecs describe-tasks \
  --cluster menotime-prod-cluster \
  --tasks `<task-arn>` \
  --query 'tasks[0].attachments[0].details[?name==`networkInterfaceId`].value'

# Check security group allows 8000
aws ec2 describe-security-groups \
  --group-ids `<security-group-id>` \
  --query 'SecurityGroups[0].IpPermissions'

High Memory Usage

Symptoms: Memory utilization >85%, tasks OOM killed

Causes: 1. Memory leak in application (check logs) 2. Working set larger than allocated (increase task memory) 3. Database connection pool too large

Fix: 1. Update task definition memory to 3GB (staging/prod) 2. Profile application memory usage 3. Review logs for "OutOfMemory" messages

Summary

ECS Fargate provides a reliable, serverless container platform for MenoTime. Key operational points:

Task Definitions: Dev/Staging (0.5 vCPU, 1GB) vs Prod (1 vCPU, 2GB)
Deployments: Rolling strategy with health checks and circuit breaker
Scaling: Auto-scaling in production (2-4 tasks); manual in dev
Health: Continuous monitoring via CloudWatch and ALB target groups
Troubleshooting: Logs in CloudWatch, task inspection via AWS CLI

For networking configuration, see Networking. For monitoring, see Monitoring.