Monitoring

MenoTime relies on comprehensive monitoring to maintain platform health, detect issues early, and ensure HIPAA compliance. This document covers CloudWatch dashboards, metrics, alarms, GuardDuty, log aggregation, and alerting strategies.

Monitoring Architecture

┌─────────────────────────────────────────────────┐
│  Application & Infrastructure Metrics            │
├─────────────────────────────────────────────────┤
│                                                  │
│  ECS Tasks → CloudWatch Metrics                  │
│  RDS Database → Performance Insights             │
│  ALB → Request/Response Metrics                  │
│  API → Custom Metrics (response time, errors)    │
│                                                  │
└──────────────┬──────────────────────────────────┘
               │
    ┌──────────┼──────────┐
    │          │          │
    ▼          ▼          ▼
┌────────┐ ┌────────┐ ┌──────────┐
│ Cloud  │ │CloudWatch
│ Watch  │ │ Logs   │ │GuardDuty │
│Metrics │ │(Agg)   │ │(Security)│
└────┬───┘ └──┬─────┘ └────┬─────┘
     │        │            │
     └────────┼────────────┘
              │
     ┌────────▼────────┐
     │ CloudWatch      │
     │ Alarms (SNS)    │
     └────────┬────────┘
              │
     ┌────────┼────────┐
     │        │        │
    ▼        ▼        ▼
  PagerDuty  Slack   Email

CloudWatch Dashboards

Production Dashboard

Name: MenoTime-Production

Refresh Rate: 1 minute

Sections:

1. Application Health (Top-Level)

Metrics Displayed:
┌─────────────────────────────────────────┐
│ ECS Task Count          ALB Status       │
│ 2 tasks running         Healthy: 2/2     │
│ (Green if 2+, Red < 2)  (Green if all)   │
│                                         │
│ Database Status         Error Rate       │
│ Connected: ✓            0.2% (Red > 1%)  │
│                                         │
│ Response Time P99       Memory Usage     │
│ 245ms (Yellow > 500ms)  72% (Yellow > 75%)
└─────────────────────────────────────────┘

2. API Performance

Metric	Threshold	Unit	Visualization
ALB Target Response Time (P50)	\<100ms	ms	Line chart (1h)
ALB Target Response Time (P95)	\<500ms	ms	Line chart (1h)
ALB Target Response Time (P99)	\<1000ms	ms	Line chart (1h)
Request Count (per minute)	Baseline	count	Area chart (6h)
HTTP 4xx Errors	\<1%	%	Line chart (1h)
HTTP 5xx Errors	0	count	Line chart (1h, alarm on >0)

3. ECS Task Health

Metric	Threshold	Unit	Visualization
Task Count	2-4	count	Number widget
CPU Utilization	\<60%	%	Line chart (1h)
Memory Utilization	\<75%	%	Line chart (1h)
Network In	Baseline	bytes/sec	Area chart (6h)
Network Out	Baseline	bytes/sec	Area chart (6h)

4. Database Health

Metric	Threshold	Unit	Visualization
Database Connections	<400 (80%)	count	Line chart (1h)
Database CPU	<75%	%	Line chart (1h)
Database Memory	<85%	%	Line chart (1h)
Read Latency	<5ms	ms	Line chart (1h)
Write Latency	<10ms	ms	Line chart (1h)
Disk Queue Depth	<10	count	Line chart (1h)
Storage Used	<80%	%	Line chart (24h)

5. Load Balancer

Metric	Threshold	Unit	Visualization
New Connection Count	Baseline	count	Area chart (6h)
Active Connection Count	Baseline	count	Line chart (6h)
Processed Bytes	Baseline	bytes	Area chart (6h)
Target Connection Errors	0	count	Line chart (1h)

6. Security & Compliance

Metric	Threshold	Unit	Visualization
GuardDuty Finding Count	0 (medium+)	count	Number widget
WAF Blocked Requests	Baseline	count	Line chart (1h)
Unauthorized API Calls	0	count	Number widget (1h)

Staging & Development Dashboards

Name: MenoTime-Staging, MenoTime-Development

Simplified views (same metrics as production but without severity-based formatting)

Key Metrics & Thresholds

ECS Metrics

Metric: ecs:service:DesiredCount
Description: Number of tasks you want running
Threshold: Prod: 2-4, Staging: 2-3, Dev: 1
Action: Manual adjustment (dev), auto-scaling (prod)

Metric: ecs:service:RunningCount
Description: Number of tasks actually running
Threshold: Should equal DesiredCount
Action: Alert if less (unhealthy tasks)

Metric: CPUUtilization
Description: CPU usage per task
Threshold: Prod: >60% (scale up), <30% (scale down)
Action: Auto-scaling trigger

Metric: MemoryUtilization
Description: Memory usage per task
Threshold: Prod: >75% (scale up), <40% (scale down)
Action: Auto-scaling trigger

Metric: NetworkIn
Description: Inbound network traffic
Threshold: Baseline ~1MB/min (spike detection)
Action: Investigate if >10× baseline

Metric: NetworkOut
Description: Outbound network traffic
Threshold: Baseline ~2MB/min (spike detection)
Action: Investigate if >10× baseline

RDS Metrics

Metric: DatabaseConnections
Description: Active database connections
Threshold: Prod: >400 (80% of 500 max), Staging: >300
Action: Alert; investigate connection leak
Recommendation: Upgrade to xlarge if sustained

Metric: CPUUtilization
Description: Database CPU usage
Threshold: Prod: >75%, Staging: >85%
Action: Alert; review slow queries
Recommendation: Optimize queries or scale up

Metric: DatabaseMemoryUsagePercentage
Description: RAM utilization
Threshold: >85%
Action: Alert; consider scaling or query optimization
Note: db.m7g.large = 8GB total

Metric: DiskQueueDepth
Description: Count of I/O requests waiting
Threshold: >10
Action: Alert; I/O bottleneck detected
Recommendation: Increase IOPS or investigate slow queries

Metric: ReadLatency
Description: Time to read from disk
Threshold: >5ms (sustained)
Action: Alert; investigate slow I/O
Note: Normal: 1-3ms for gp3 SSD

Metric: WriteLatency
Description: Time to write to disk
Threshold: >10ms (sustained)
Action: Alert; check WAL activity
Note: Normal: 2-5ms for gp3 SSD

Metric: BinLogDiskUsage
Description: Transaction log disk usage
Threshold: >80GB
Action: Alert; backup/archive logs
Note: Prevents "low storage" errors

Metric: StorageSpace
Description: Total database storage used
Threshold: >80% of allocated
Action: Alert; expand storage before hitting limit
Note: Prod: 1TB, Staging: 500GB

ALB Metrics

Metric: TargetResponseTime (P95)
Description: 95th percentile response time
Threshold: Prod: >500ms
Action: Scale up or optimize backend
Note: Includes network latency + processing time

Metric: TargetResponseTime (P99)
Description: 99th percentile response time
Threshold: Prod: >1000ms
Action: Alert; investigate slow requests
Note: P99 indicates tail latency issues

Metric: HTTPCode_Target_5XX_Count
Description: Backend (ECS) error responses
Threshold: >0 per minute
Action: Alert; critical issue
Note: Indicates application crash or unhandled exception

Metric: HTTPCode_Target_4XX_Count
Description: Client error responses (400, 404, etc.)
Threshold: >5% of total requests
Action: Investigate; may indicate bad data or API changes

Metric: RequestCount
Description: Total requests processed
Threshold: Baseline metric (trend analysis)
Action: Spike >2× baseline warrants investigation
Note: Used for capacity planning

Metric: ActiveConnectionCount
Description: Open connections to targets
Threshold: Baseline metric
Action: Spike indicates heavy traffic or connection leak

Metric: TargetConnectionErrorCount
Description: Failed connections to targets
Threshold: >0
Action: Alert; target is unhealthy or overloaded
Note: Usually correlates with task scaling

Alarms & Notifications

Critical Alarms (Prod Only)

Alarm 1: Database Down

Metric: RDS DatabaseConnections
Condition: ≤ 0 for 1 minute
Action: SNS → PagerDuty (page on-call)
Severity: P1 (immediate response)
Runbook: https://wiki.menotime.ai/runbooks/db-down

Alarm 2: All ECS Tasks Unhealthy

Metric: ECS TargetHealthCheckCount (failed)
Condition: ≥ DesiredCount (all tasks failed) for 1 minute
Action: SNS → PagerDuty (page on-call)
Severity: P1 (immediate response)
Runbook: https://wiki.menotime.ai/runbooks/ecs-down

Alarm 3: High Error Rate

Metric: ALB HTTPCode_Target_5XX_Count
Condition: ≥ 10 errors per minute (>1%)
Action: SNS → PagerDuty (page on-call)
Severity: P1 (immediate response)
Runbook: https://wiki.menotime.ai/runbooks/high-errors

Warning Alarms (Prod & Staging)

Alarm 4: Database Connections High

Metric: RDS DatabaseConnections
Condition: ≥ 400 for 5 minutes
Action: SNS → Slack #alerts
Severity: P2 (investigate within 1 hour)
Action: Review connection pool; scale up if trending

Alarm 5: Database CPU High

Metric: RDS CPUUtilization
Condition: ≥ 75% for 10 minutes
Action: SNS → Slack #alerts
Severity: P2
Action: Investigate slow queries; scale up if persistent

Alarm 6: Task Memory Usage High

Metric: ECS MemoryUtilization
Condition: ≥ 80% for 5 minutes
Action: SNS → Slack #alerts
Severity: P2
Action: Scale up task memory or reduce task count

Alarm 7: API Response Time High

Metric: ALB TargetResponseTime (P99)
Condition: ≥ 1000ms for 5 minutes
Action: SNS → Slack #alerts
Severity: P2
Action: Investigate slow endpoints; scale ECS if needed

Alarm 8: WAF Blocked Requests Spike

Metric: WAF BlockedRequests
Condition: ≥ 50 per minute (unusual traffic)
Action: SNS → Slack #security
Severity: P2 (security investigation)
Action: Review WAF logs for attack pattern

Informational Alarms (Staging & Dev)

Alarm 9: Disk Storage Low

Metric: RDS StorageSpace
Condition: ≥ 80% of allocated
Action: SNS → Slack #ops
Severity: Info (plan expansion)

Alarm 10: Backup Failed

Metric: RDS AutomatedBackupCount (or SNS from backup Lambda)
Condition: No backup in last 24 hours
Action: SNS → Slack #ops
Severity: Info (verify backup health)

Composite Alarms

Alarm 11: Database Performance Degradation

Triggers if ALL of:
- Database CPU > 70%
- Read Latency > 5ms
- Write Latency > 10ms
- Disk Queue Depth > 10

Action: SNS → PagerDuty (P2)
Recommendation: Scale database; optimize queries

CloudWatch Logs

Log Groups

Log Group	Retention	Source	Purpose
`/aws/ecs/menotime-prod`	30 days	ECS tasks	Application logs
`/aws/ecs/menotime-staging`	7 days	ECS tasks	Application logs
`/aws/ecs/menotime-dev`	7 days	ECS tasks	Application logs
`/aws/rds/menotime/postgresql`	7 days	RDS	Database logs (errors, slow queries)
`/aws/alb/menotime`	7 days	ALB	Access logs
`/aws/waf/menotime`	7 days	WAF	Web application firewall logs
`/aws/cloudtrail/menotime`	365 days	CloudTrail	API audit trail

Log Encryption

All log groups encrypted with KMS customer-managed key (alias/menotime-master)

Log Queries (Examples)

Find errors in ECS logs:

fields @timestamp, @message, @logStream
| filter @message like /ERROR|Exception|error/
| stats count() as error_count by @logStream

Find slow database queries:

fields @timestamp, duration
| filter duration > 1000
| stats max(duration) as max_duration, avg(duration) as avg_duration by log_identifier
| sort avg_duration desc

Find 5xx errors by endpoint:

fields @timestamp, request_path, status_code
| filter status_code like /5\d\d/
| stats count() as error_count by request_path

Database connection attempts:

fields user_name, database_name, status
| filter status like /FAILED/
| stats count() as failed_attempts by user_name

GuardDuty (Threat Detection)

GuardDuty Status

Enabled: Yes Coverage: All ECS, RDS, S3, API calls Finding Types: Network activity, API calls, resource interactions

Finding Severity Levels

Level	Color	Response Time	Example
High	Red	Immediate	Cryptocurrency mining detected, data exfiltration
Medium	Orange	1-24 hours	Unusual API calls, port scanning
Low	Yellow	As time allows	Known malware IP accessed, test traffic

High-Severity Findings

Trigger Immediate Investigation: 1. Unusual EC2 behavior (cryptocurrency mining, botnet) 2. Potential data exfiltration 3. Unauthorized API access 4. Unusual network traffic patterns

Response Process: 1. Alert triggers SNS → PagerDuty 2. On-call engineer reviews in GuardDuty console 3. Investigation and containment within 1 hour 4. Post-incident review if confirmed threat

Command to review findings:

aws guardduty list-findings \
  --detector-id xxxxx \
  --finding-criteria 'Criterion={Severity={Gte="7"}}'

Medium-Severity Findings

Weekly Review: - Check GuardDuty console Thursday morning - Assess false positives vs. legitimate detections - Document patterns (e.g., expected cross-region replication)

False Positive Examples

Common benign findings: - Cross-account access (intentional S3 replication) - Scheduled AWS Lambda backups - Internal IAM role access - CloudFormation stack operations

Performance Insights

Performance Insights Setup

Enabled in: RDS Production and Staging

Retention: 7 days (production), 30 days (staging with extended retention)

Metrics: - Active Sessions: Concurrent connections and activity - Database Load: CPU usage, I/O wait, lock contention - Wait Events: Where database time is spent - Top Dimensions: SQL, Users, Hosts - Top SQL: Slowest queries by total time

Performance Insights Query Analysis

Top Queries (by load):

SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

Lock Contention:

SELECT pid, usename, state, wait_event_type, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start;

Connection Idle Time:

SELECT usename, state, count(*)
FROM pg_stat_activity
GROUP BY usename, state;

Log Aggregation Strategy

Log Collection Flow

Application Container
  ↓ (stdout/stderr)
CloudWatch Logs
  ↓ (awslogs driver)
LogGroup: /aws/ecs/menotime-prod
  ↓ (Subscription Filter)
CloudWatch Logs Insights (ad-hoc queries)
  ↓ (for-each finding)
SNS Topic (critical errors)
  ↓
PagerDuty / Slack

Subscription Filters

Filter 1: Errors → PagerDuty

Filter Pattern: [... level = ERROR* ...]
Action: SNS → PagerDuty
Condition: Triggers on every ERROR logged

Filter 2: Slow Queries → Slack

Filter Pattern: [... duration > 5000 ...]
Action: SNS → Slack #database
Condition: Triggers when query > 5 seconds

Filter 3: Security Events → Slack

Filter Pattern: [... event_type = AUTHENTICATION_FAILED ...]
Action: SNS → Slack #security
Condition: Triggers on auth failures

Alerting Configuration

Topic	Subscribers	Purpose
`menotime-critical`	PagerDuty (P1)	Database down, all tasks down, high error rate
`menotime-alerts`	Slack #alerts	P2 issues (memory, CPU, disk)
`menotime-security`	Slack #security	GuardDuty findings, WAF blocks
`menotime-ops`	Email, Slack #ops	Backups, scaling, routine notifications

Email Alerts

Frequency: Daily digest at 08:00 AM UTC

Content: - Summary of alarms triggered - Health check status - Top errors from CloudWatch Logs - Unresolved findings from GuardDuty

Slack Integration

Channel Mappings: - #alerts — P2+ operational issues - #security — Security findings and WAF events - #ops — Routine maintenance notifications - #deployments — ECS deployment events - #database — Database performance notifications

Example Alert Format:

🚨 Critical: Database CPU > 75%

Metric: menotime-prod CPU Utilization
Current: 78%
Duration: 8 minutes
Threshold: > 75%

Action: Investigate slow queries or scale database
Runbook: https://wiki.menotime.ai/runbooks/db-cpu-high

Monitoring Runbook

Daily Checks (Automated)

08:00 UTC:
  - Automated daily health check runs
  - Email summary sent to ops
  - Dashboard reviewed by on-call engineer

Every 5 minutes:
  - CloudWatch Alarms check metrics
  - Critical alarms trigger PagerDuty
  - Warning alarms post to Slack

Every hour:
  - Performance Insights reviewed for anomalies
  - Log Insights queries run for errors
  - GuardDuty findings reviewed (high severity)

Weekly Checks (Manual)

Every Monday: - Review past week's alarms and incidents - Check CloudWatch dashboard for anomalies - Verify backup status - Review cost trends

Every Thursday: - GuardDuty findings review - WAF rule effectiveness assessment - Database performance optimization review

Every Friday: - Prepare production deployment checklist - Verify all monitoring systems operational - Test alert routing (send test SNS message)

Monthly Checks

First Tuesday: - Full monitoring system health check - Verify all logs being collected - Test SNS→PagerDuty→Slack chain - Review and update runbooks

Cost Review: - CloudWatch Logs retention costs - GuardDuty charges - WAF charges - Optimize if >20% of total infrastructure cost

Monitoring Best Practices

Alert Fatigue: Tune thresholds to reduce false positives (>10 alerts/day is too many)
Response SLAs: P1 response \<15 min, P2 \<1 hour, P3 \<24 hours
Runbooks: Every alarm should have associated runbook
Testing: Monthly alert test to verify SNS→PagerDuty→Slack chain
Retention: Keep logs long enough for audit (min. 30 days)
Encryption: All logs encrypted with KMS
Access Control: IAM roles restrict who can view/modify alarms

Cost of Monitoring

Service	Monthly Cost	Notes
CloudWatch Metrics	~$10	Custom metrics for API endpoints
CloudWatch Logs	~$15-30	Ingestion + storage at 7-30 day retention
GuardDuty	~$30	Account-wide threat detection
WAF	~$5-20	Per rule evaluation
Total	~$60-90	~10% of infrastructure cost

Summary

MenoTime's monitoring stack provides: - Visibility: Real-time dashboards and metrics - Alerting: Multi-channel notifications (PagerDuty, Slack, email) - Security: GuardDuty threat detection + WAF - Compliance: Comprehensive audit logging - Performance: Performance Insights for database optimization

Key dashboards, alarms, and runbooks ensure rapid issue detection and response, protecting patient data and platform availability.

For operational procedures, see Environments and ECS Fargate.