Monitoring
MenoTime relies on comprehensive monitoring to maintain platform health, detect issues early, and ensure HIPAA compliance. This document covers CloudWatch dashboards, metrics, alarms, GuardDuty, log aggregation, and alerting strategies.
Monitoring Architecture
┌─────────────────────────────────────────────────┐
│ Application & Infrastructure Metrics │
├─────────────────────────────────────────────────┤
│ │
│ ECS Tasks → CloudWatch Metrics │
│ RDS Database → Performance Insights │
│ ALB → Request/Response Metrics │
│ API → Custom Metrics (response time, errors) │
│ │
└──────────────┬──────────────────────────────────┘
│
┌──────────┼──────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌──────────┐
│ Cloud │ │CloudWatch
│ Watch │ │ Logs │ │GuardDuty │
│Metrics │ │(Agg) │ │(Security)│
└────┬───┘ └──┬─────┘ └────┬─────┘
│ │ │
└────────┼────────────┘
│
┌────────▼────────┐
│ CloudWatch │
│ Alarms (SNS) │
└────────┬────────┘
│
┌────────┼────────┐
│ │ │
▼ ▼ ▼
PagerDuty Slack Email
CloudWatch Dashboards
Production Dashboard
Name: MenoTime-Production
Refresh Rate: 1 minute
Sections:
1. Application Health (Top-Level)
Metrics Displayed:
┌─────────────────────────────────────────┐
│ ECS Task Count ALB Status │
│ 2 tasks running Healthy: 2/2 │
│ (Green if 2+, Red < 2) (Green if all) │
│ │
│ Database Status Error Rate │
│ Connected: ✓ 0.2% (Red > 1%) │
│ │
│ Response Time P99 Memory Usage │
│ 245ms (Yellow > 500ms) 72% (Yellow > 75%)
└─────────────────────────────────────────┘
2. API Performance
| Metric | Threshold | Unit | Visualization |
|---|---|---|---|
| ALB Target Response Time (P50) | \<100ms | ms | Line chart (1h) |
| ALB Target Response Time (P95) | \<500ms | ms | Line chart (1h) |
| ALB Target Response Time (P99) | \<1000ms | ms | Line chart (1h) |
| Request Count (per minute) | Baseline | count | Area chart (6h) |
| HTTP 4xx Errors | \<1% | % | Line chart (1h) |
| HTTP 5xx Errors | 0 | count | Line chart (1h, alarm on >0) |
3. ECS Task Health
| Metric | Threshold | Unit | Visualization |
|---|---|---|---|
| Task Count | 2-4 | count | Number widget |
| CPU Utilization | \<60% | % | Line chart (1h) |
| Memory Utilization | \<75% | % | Line chart (1h) |
| Network In | Baseline | bytes/sec | Area chart (6h) |
| Network Out | Baseline | bytes/sec | Area chart (6h) |
4. Database Health
| Metric | Threshold | Unit | Visualization |
|---|---|---|---|
| Database Connections | <400 (80%) | count | Line chart (1h) |
| Database CPU | <75% | % | Line chart (1h) |
| Database Memory | <85% | % | Line chart (1h) |
| Read Latency | <5ms | ms | Line chart (1h) |
| Write Latency | <10ms | ms | Line chart (1h) |
| Disk Queue Depth | <10 | count | Line chart (1h) |
| Storage Used | <80% | % | Line chart (24h) |
5. Load Balancer
| Metric | Threshold | Unit | Visualization |
|---|---|---|---|
| New Connection Count | Baseline | count | Area chart (6h) |
| Active Connection Count | Baseline | count | Line chart (6h) |
| Processed Bytes | Baseline | bytes | Area chart (6h) |
| Target Connection Errors | 0 | count | Line chart (1h) |
6. Security & Compliance
| Metric | Threshold | Unit | Visualization |
|---|---|---|---|
| GuardDuty Finding Count | 0 (medium+) | count | Number widget |
| WAF Blocked Requests | Baseline | count | Line chart (1h) |
| Unauthorized API Calls | 0 | count | Number widget (1h) |
Staging & Development Dashboards
Name: MenoTime-Staging, MenoTime-Development
Simplified views (same metrics as production but without severity-based formatting)
Key Metrics & Thresholds
ECS Metrics
Metric: ecs:service:DesiredCount
Description: Number of tasks you want running
Threshold: Prod: 2-4, Staging: 2-3, Dev: 1
Action: Manual adjustment (dev), auto-scaling (prod)
Metric: ecs:service:RunningCount
Description: Number of tasks actually running
Threshold: Should equal DesiredCount
Action: Alert if less (unhealthy tasks)
Metric: CPUUtilization
Description: CPU usage per task
Threshold: Prod: >60% (scale up), <30% (scale down)
Action: Auto-scaling trigger
Metric: MemoryUtilization
Description: Memory usage per task
Threshold: Prod: >75% (scale up), <40% (scale down)
Action: Auto-scaling trigger
Metric: NetworkIn
Description: Inbound network traffic
Threshold: Baseline ~1MB/min (spike detection)
Action: Investigate if >10× baseline
Metric: NetworkOut
Description: Outbound network traffic
Threshold: Baseline ~2MB/min (spike detection)
Action: Investigate if >10× baseline
RDS Metrics
Metric: DatabaseConnections
Description: Active database connections
Threshold: Prod: >400 (80% of 500 max), Staging: >300
Action: Alert; investigate connection leak
Recommendation: Upgrade to xlarge if sustained
Metric: CPUUtilization
Description: Database CPU usage
Threshold: Prod: >75%, Staging: >85%
Action: Alert; review slow queries
Recommendation: Optimize queries or scale up
Metric: DatabaseMemoryUsagePercentage
Description: RAM utilization
Threshold: >85%
Action: Alert; consider scaling or query optimization
Note: db.m7g.large = 8GB total
Metric: DiskQueueDepth
Description: Count of I/O requests waiting
Threshold: >10
Action: Alert; I/O bottleneck detected
Recommendation: Increase IOPS or investigate slow queries
Metric: ReadLatency
Description: Time to read from disk
Threshold: >5ms (sustained)
Action: Alert; investigate slow I/O
Note: Normal: 1-3ms for gp3 SSD
Metric: WriteLatency
Description: Time to write to disk
Threshold: >10ms (sustained)
Action: Alert; check WAL activity
Note: Normal: 2-5ms for gp3 SSD
Metric: BinLogDiskUsage
Description: Transaction log disk usage
Threshold: >80GB
Action: Alert; backup/archive logs
Note: Prevents "low storage" errors
Metric: StorageSpace
Description: Total database storage used
Threshold: >80% of allocated
Action: Alert; expand storage before hitting limit
Note: Prod: 1TB, Staging: 500GB
ALB Metrics
Metric: TargetResponseTime (P95)
Description: 95th percentile response time
Threshold: Prod: >500ms
Action: Scale up or optimize backend
Note: Includes network latency + processing time
Metric: TargetResponseTime (P99)
Description: 99th percentile response time
Threshold: Prod: >1000ms
Action: Alert; investigate slow requests
Note: P99 indicates tail latency issues
Metric: HTTPCode_Target_5XX_Count
Description: Backend (ECS) error responses
Threshold: >0 per minute
Action: Alert; critical issue
Note: Indicates application crash or unhandled exception
Metric: HTTPCode_Target_4XX_Count
Description: Client error responses (400, 404, etc.)
Threshold: >5% of total requests
Action: Investigate; may indicate bad data or API changes
Metric: RequestCount
Description: Total requests processed
Threshold: Baseline metric (trend analysis)
Action: Spike >2× baseline warrants investigation
Note: Used for capacity planning
Metric: ActiveConnectionCount
Description: Open connections to targets
Threshold: Baseline metric
Action: Spike indicates heavy traffic or connection leak
Metric: TargetConnectionErrorCount
Description: Failed connections to targets
Threshold: >0
Action: Alert; target is unhealthy or overloaded
Note: Usually correlates with task scaling
Alarms & Notifications
Critical Alarms (Prod Only)
Alarm 1: Database Down
Metric: RDS DatabaseConnections
Condition: ≤ 0 for 1 minute
Action: SNS → PagerDuty (page on-call)
Severity: P1 (immediate response)
Runbook: https://wiki.menotime.ai/runbooks/db-down
Alarm 2: All ECS Tasks Unhealthy
Metric: ECS TargetHealthCheckCount (failed)
Condition: ≥ DesiredCount (all tasks failed) for 1 minute
Action: SNS → PagerDuty (page on-call)
Severity: P1 (immediate response)
Runbook: https://wiki.menotime.ai/runbooks/ecs-down
Alarm 3: High Error Rate
Metric: ALB HTTPCode_Target_5XX_Count
Condition: ≥ 10 errors per minute (>1%)
Action: SNS → PagerDuty (page on-call)
Severity: P1 (immediate response)
Runbook: https://wiki.menotime.ai/runbooks/high-errors
Warning Alarms (Prod & Staging)
Alarm 4: Database Connections High
Metric: RDS DatabaseConnections
Condition: ≥ 400 for 5 minutes
Action: SNS → Slack #alerts
Severity: P2 (investigate within 1 hour)
Action: Review connection pool; scale up if trending
Alarm 5: Database CPU High
Metric: RDS CPUUtilization
Condition: ≥ 75% for 10 minutes
Action: SNS → Slack #alerts
Severity: P2
Action: Investigate slow queries; scale up if persistent
Alarm 6: Task Memory Usage High
Metric: ECS MemoryUtilization
Condition: ≥ 80% for 5 minutes
Action: SNS → Slack #alerts
Severity: P2
Action: Scale up task memory or reduce task count
Alarm 7: API Response Time High
Metric: ALB TargetResponseTime (P99)
Condition: ≥ 1000ms for 5 minutes
Action: SNS → Slack #alerts
Severity: P2
Action: Investigate slow endpoints; scale ECS if needed
Alarm 8: WAF Blocked Requests Spike
Metric: WAF BlockedRequests
Condition: ≥ 50 per minute (unusual traffic)
Action: SNS → Slack #security
Severity: P2 (security investigation)
Action: Review WAF logs for attack pattern
Informational Alarms (Staging & Dev)
Alarm 9: Disk Storage Low
Metric: RDS StorageSpace
Condition: ≥ 80% of allocated
Action: SNS → Slack #ops
Severity: Info (plan expansion)
Alarm 10: Backup Failed
Metric: RDS AutomatedBackupCount (or SNS from backup Lambda)
Condition: No backup in last 24 hours
Action: SNS → Slack #ops
Severity: Info (verify backup health)
Composite Alarms
Alarm 11: Database Performance Degradation
Triggers if ALL of:
- Database CPU > 70%
- Read Latency > 5ms
- Write Latency > 10ms
- Disk Queue Depth > 10
Action: SNS → PagerDuty (P2)
Recommendation: Scale database; optimize queries
CloudWatch Logs
Log Groups
| Log Group | Retention | Source | Purpose |
|---|---|---|---|
/aws/ecs/menotime-prod |
30 days | ECS tasks | Application logs |
/aws/ecs/menotime-staging |
7 days | ECS tasks | Application logs |
/aws/ecs/menotime-dev |
7 days | ECS tasks | Application logs |
/aws/rds/menotime/postgresql |
7 days | RDS | Database logs (errors, slow queries) |
/aws/alb/menotime |
7 days | ALB | Access logs |
/aws/waf/menotime |
7 days | WAF | Web application firewall logs |
/aws/cloudtrail/menotime |
365 days | CloudTrail | API audit trail |
Log Encryption
All log groups encrypted with KMS customer-managed key (alias/menotime-master)
Log Queries (Examples)
Find errors in ECS logs:
fields @timestamp, @message, @logStream
| filter @message like /ERROR|Exception|error/
| stats count() as error_count by @logStream
Find slow database queries:
fields @timestamp, duration
| filter duration > 1000
| stats max(duration) as max_duration, avg(duration) as avg_duration by log_identifier
| sort avg_duration desc
Find 5xx errors by endpoint:
fields @timestamp, request_path, status_code
| filter status_code like /5\d\d/
| stats count() as error_count by request_path
Database connection attempts:
fields user_name, database_name, status
| filter status like /FAILED/
| stats count() as failed_attempts by user_name
GuardDuty (Threat Detection)
GuardDuty Status
Enabled: Yes Coverage: All ECS, RDS, S3, API calls Finding Types: Network activity, API calls, resource interactions
Finding Severity Levels
| Level | Color | Response Time | Example |
|---|---|---|---|
| High | Red | Immediate | Cryptocurrency mining detected, data exfiltration |
| Medium | Orange | 1-24 hours | Unusual API calls, port scanning |
| Low | Yellow | As time allows | Known malware IP accessed, test traffic |
High-Severity Findings
Trigger Immediate Investigation: 1. Unusual EC2 behavior (cryptocurrency mining, botnet) 2. Potential data exfiltration 3. Unauthorized API access 4. Unusual network traffic patterns
Response Process: 1. Alert triggers SNS → PagerDuty 2. On-call engineer reviews in GuardDuty console 3. Investigation and containment within 1 hour 4. Post-incident review if confirmed threat
Command to review findings:
aws guardduty list-findings \
--detector-id xxxxx \
--finding-criteria 'Criterion={Severity={Gte="7"}}'
Medium-Severity Findings
Weekly Review: - Check GuardDuty console Thursday morning - Assess false positives vs. legitimate detections - Document patterns (e.g., expected cross-region replication)
False Positive Examples
Common benign findings: - Cross-account access (intentional S3 replication) - Scheduled AWS Lambda backups - Internal IAM role access - CloudFormation stack operations
Performance Insights
Performance Insights Setup
Enabled in: RDS Production and Staging
Retention: 7 days (production), 30 days (staging with extended retention)
Metrics: - Active Sessions: Concurrent connections and activity - Database Load: CPU usage, I/O wait, lock contention - Wait Events: Where database time is spent - Top Dimensions: SQL, Users, Hosts - Top SQL: Slowest queries by total time
Performance Insights Query Analysis
Top Queries (by load):
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
Lock Contention:
SELECT pid, usename, state, wait_event_type, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start;
Connection Idle Time:
SELECT usename, state, count(*)
FROM pg_stat_activity
GROUP BY usename, state;
Log Aggregation Strategy
Log Collection Flow
Application Container
↓ (stdout/stderr)
CloudWatch Logs
↓ (awslogs driver)
LogGroup: /aws/ecs/menotime-prod
↓ (Subscription Filter)
CloudWatch Logs Insights (ad-hoc queries)
↓ (for-each finding)
SNS Topic (critical errors)
↓
PagerDuty / Slack
Subscription Filters
Filter 1: Errors → PagerDuty
Filter Pattern: [... level = ERROR* ...]
Action: SNS → PagerDuty
Condition: Triggers on every ERROR logged
Filter 2: Slow Queries → Slack
Filter Pattern: [... duration > 5000 ...]
Action: SNS → Slack #database
Condition: Triggers when query > 5 seconds
Filter 3: Security Events → Slack
Filter Pattern: [... event_type = AUTHENTICATION_FAILED ...]
Action: SNS → Slack #security
Condition: Triggers on auth failures
Alerting Configuration
SNS Topics
| Topic | Subscribers | Purpose |
|---|---|---|
menotime-critical |
PagerDuty (P1) | Database down, all tasks down, high error rate |
menotime-alerts |
Slack #alerts | P2 issues (memory, CPU, disk) |
menotime-security |
Slack #security | GuardDuty findings, WAF blocks |
menotime-ops |
Email, Slack #ops | Backups, scaling, routine notifications |
Email Alerts
Frequency: Daily digest at 08:00 AM UTC
Content: - Summary of alarms triggered - Health check status - Top errors from CloudWatch Logs - Unresolved findings from GuardDuty
Slack Integration
Channel Mappings:
- #alerts — P2+ operational issues
- #security — Security findings and WAF events
- #ops — Routine maintenance notifications
- #deployments — ECS deployment events
- #database — Database performance notifications
Example Alert Format:
🚨 Critical: Database CPU > 75%
Metric: menotime-prod CPU Utilization
Current: 78%
Duration: 8 minutes
Threshold: > 75%
Action: Investigate slow queries or scale database
Runbook: https://wiki.menotime.ai/runbooks/db-cpu-high
Monitoring Runbook
Daily Checks (Automated)
08:00 UTC:
- Automated daily health check runs
- Email summary sent to ops
- Dashboard reviewed by on-call engineer
Every 5 minutes:
- CloudWatch Alarms check metrics
- Critical alarms trigger PagerDuty
- Warning alarms post to Slack
Every hour:
- Performance Insights reviewed for anomalies
- Log Insights queries run for errors
- GuardDuty findings reviewed (high severity)
Weekly Checks (Manual)
Every Monday: - Review past week's alarms and incidents - Check CloudWatch dashboard for anomalies - Verify backup status - Review cost trends
Every Thursday: - GuardDuty findings review - WAF rule effectiveness assessment - Database performance optimization review
Every Friday: - Prepare production deployment checklist - Verify all monitoring systems operational - Test alert routing (send test SNS message)
Monthly Checks
First Tuesday: - Full monitoring system health check - Verify all logs being collected - Test SNS→PagerDuty→Slack chain - Review and update runbooks
Cost Review: - CloudWatch Logs retention costs - GuardDuty charges - WAF charges - Optimize if >20% of total infrastructure cost
Monitoring Best Practices
- Alert Fatigue: Tune thresholds to reduce false positives (>10 alerts/day is too many)
- Response SLAs: P1 response \<15 min, P2 \<1 hour, P3 \<24 hours
- Runbooks: Every alarm should have associated runbook
- Testing: Monthly alert test to verify SNS→PagerDuty→Slack chain
- Retention: Keep logs long enough for audit (min. 30 days)
- Encryption: All logs encrypted with KMS
- Access Control: IAM roles restrict who can view/modify alarms
Cost of Monitoring
| Service | Monthly Cost | Notes |
|---|---|---|
| CloudWatch Metrics | ~$10 | Custom metrics for API endpoints |
| CloudWatch Logs | ~$15-30 | Ingestion + storage at 7-30 day retention |
| GuardDuty | ~$30 | Account-wide threat detection |
| WAF | ~$5-20 | Per rule evaluation |
| Total | ~$60-90 | ~10% of infrastructure cost |
Summary
MenoTime's monitoring stack provides: - Visibility: Real-time dashboards and metrics - Alerting: Multi-channel notifications (PagerDuty, Slack, email) - Security: GuardDuty threat detection + WAF - Compliance: Comprehensive audit logging - Performance: Performance Insights for database optimization
Key dashboards, alarms, and runbooks ensure rapid issue detection and response, protecting patient data and platform availability.
For operational procedures, see Environments and ECS Fargate.