Database

MenoTime uses Amazon RDS PostgreSQL as the primary relational database for patient records, clinical data, appointments, and platform state. This document covers database configuration, connection management, backup strategy, and migration procedures.

RDS PostgreSQL Configuration

Instance Specifications

All three environments (Dev, Staging, Production) use the same instance class to ensure parity and realistic staging testing.

Configuration	Value
Engine	PostgreSQL 15.x
Instance Class	db.m7g.large
vCPU	2 cores
Memory	8 GB RAM
Storage Type	gp3 (General Purpose SSD)
Storage Size	Dev: 100GB, Staging: 500GB, Prod: 1TB
Storage Auto-Scaling	Enabled (max 20% beyond allocated)
IOPS	3,000 baseline (configurable)
Throughput	125 MB/s baseline (configurable)
Region	us-west-1 (N. California)
Availability	Single-AZ (Multi-AZ upgrade available)

Why db.m7g.large?

Capacity Analysis: - vCPU: 2 cores sufficient for ~500+ concurrent connections - Memory: 8 GB provides 25% overhead for caching and working set - Network: Up to 10 Gbps burst capability - Cost: ~$460/month per instance (3 environments = ~$1,380 baseline) - Performance: Burstable CPU (good for variable workloads)

Upgrade Path: - If production patient volume exceeds 5,000: upgrade to db.m7g.xlarge (4 vCPU, 16 GB) - Multi-AZ recommended for production at scale (adds 50% cost)

Network Configuration

Subnet Group: menotime-db-subnet-group - Spans private subnets in us-west-1a and us-west-1b - No public internet access (secure by default)

Security Group: menotime-db-sg

Inbound Rules:
- PostgreSQL (5432) from menotime-ecs-sg (ECS security group)
- PostgreSQL (5432) from menotime-bastion-sg (bastion host, if used)

Outbound Rules:
- Deny all (RDS doesn't initiate outbound connections)

Encryption: - At Rest: KMS-managed encryption (customer-managed key alias/menotime-master) - In Transit: SSL/TLS required for all connections - Connection String: postgresql://user:password@host:5432/menotime_prod?sslmode=require

Parameter Group

Name: menotime-postgresql-15-params

Key Parameters:

max_connections = 500
shared_buffers = 2GB (25% of total memory)
effective_cache_size = 6GB (75% of total memory)
maintenance_work_mem = 512MB
work_mem = 8MB (shared_buffers / max_connections * 10)
wal_buffers = 16MB
min_wal_size = 2GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
log_min_duration_statement = 1000 (log queries >1s)
log_connections = on
log_disconnections = on
log_statement = 'all' (development), 'mod' (production - log modifications only)
log_duration = on
shared_preload_libraries = 'pgaudit,pg_stat_statements'

pgAudit Configuration (HIPAA compliance):

pgaudit.log = 'ALL'
pgaudit.log_level = 'LOG'
pgaudit.role = 'audit'

Logs all database activity including: - SELECT (with pgaudit.log = 'ALL') - INSERT, UPDATE, DELETE - DDL (CREATE, ALTER, DROP) - User authentication

Maintenance Windows

Backup Window: 03:00 UTC daily (10:00 PM PDT) - Automated snapshots retained for 7 days - No downtime (backups taken from replica for Multi-AZ)

Maintenance Window: Sunday 04:00-05:00 UTC (9:00-10:00 PM PDT) - Applied automatically - Minor version upgrades (e.g., 15.2 → 15.3) - Security patches applied with minimal downtime - Coordinate with on-call engineer for production

Backup Strategy

Automated Backups: - Retention: 7 days - Frequency: Daily at 03:00 UTC - Location: AWS-managed (us-west-1 region) - RPO (Recovery Point Objective): 24 hours - RTO (Recovery Time Objective): 5-10 minutes (restore from backup)

Point-in-Time Recovery: - Enabled (leverages transaction logs) - Can restore to any point within the 7-day retention window - Useful for accidental data deletion recovery

Manual Snapshots: - Created before major application changes - Created before environment promotions (for easy rollback) - Command: bash aws rds create-db-snapshot \ --db-instance-identifier menotime-prod \ --db-snapshot-identifier menotime-prod-pre-v1.2.3-20240201

Snapshot Lifecycle:

Automated Backup (7 days) → Manual Snapshot (indefinite)
                                ↓
                    Copy to menotime-backups S3
                                ↓
                        Lifecycle: 30 days
                        (Move to Glacier)

Copy to S3: - Automated Lambda function triggers nightly - Creates RDS snapshot export to S3 in Parquet format - Retention: 2 years in Glacier (lowest cost archival) - Use case: Long-term compliance, historical analysis

Database Schema & Migrations

Schema Overview

Primary Tables: - users — Patient accounts and clinic staff - patients — Detailed patient profiles with HIPAA PII - appointments — Clinic visits and consultations - clinical_notes — Medical records and observations - medications — Prescribed treatments - measurements — Vital signs and symptom tracking - audit_log — All data access and modifications (HIPAA)

Migration Tool: Alembic

Purpose: Database schema versioning and migration management

Location: /migrations/ directory (in application repository)

Structure:

migrations/
├── alembic.ini
├── env.py
├── script.py.mako
└── versions/
    ├── 001_initial_schema.py
    ├── 002_add_patient_preferences.py
    ├── 003_add_audit_logging.py
    └── ...

Migration Naming Convention:

{sequence}_{short_description}.py
001_initial_schema.py
002_add_patient_preferences.py
003_add_audit_logging.py (auto-incrementing)

Migration Workflow

Before Deployment:

Create Migration: bash alembic revision --autogenerate -m "Add patient preferences table"
Test Migration (local & dev): bash alembic upgrade head
Test Rollback: bash alembic downgrade -1 alembic upgrade head
Include in Commit:
Migration files in /migrations/versions/
Application code using new schema
Database tests validating migration

During Deployment:

ECS Task Definition: Includes migration step bash # In container startup alembic upgrade head gunicorn main:app
Health Check: Waits for migration completion before running app
Timeout: 60 seconds for migration (increase for large migrations)

Post-Deployment:

Validation: Schema version in alembic_version table matches migration number
Rollback (if needed): bash alembic downgrade {previous_revision}

Example Migration File

# migrations/versions/003_add_audit_logging.py

from alembic import op
import sqlalchemy as sa

revision = '003'
down_revision = '002'
branch_labels = None
depends_on = None

def upgrade():
    op.create_table(
        'audit_log',
        sa.Column('id', sa.Integer, primary_key=True),
        sa.Column('user_id', sa.Integer, sa.ForeignKey('users.id')),
        sa.Column('table_name', sa.String(255)),
        sa.Column('action', sa.String(10)),  # INSERT, UPDATE, DELETE
        sa.Column('old_values', sa.JSON),
        sa.Column('new_values', sa.JSON),
        sa.Column('timestamp', sa.DateTime, default=sa.func.now()),
    )
    op.create_index('ix_audit_log_timestamp', 'audit_log', ['timestamp'])

def downgrade():
    op.drop_table('audit_log')

Connection Management

Connection Strings by Environment

Development:

postgresql://menotime_user:${DB_PASSWORD}@menotime-dev.xxxxx.us-west-1.rds.amazonaws.com:5432/menotime_dev?sslmode=require

Staging:

postgresql://menotime_user:${DB_PASSWORD}@menotime-staging.xxxxx.us-west-1.rds.amazonaws.com:5432/menotime_staging?sslmode=require

Production:

postgresql://menotime_user:${DB_PASSWORD}@menotime-prod.xxxxx.us-west-1.rds.amazonaws.com:5432/menotime_prod?sslmode=require

These connection strings use <environment> placeholders for the instance identifier.

Connection Pool Configuration

Application: FastAPI + SQLAlchemy

Pool Settings (in main.py):

from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=10,  # Number of connections to keep in pool
    max_overflow=20,  # Additional connections if needed
    pool_pre_ping=True,  # Verify connection before using
    pool_recycle=3600,  # Recycle connections every hour
    echo_pool=False,  # Disable pool logging (verbose)
)

Pool Monitoring: - Monitor via CloudWatch: DatabaseConnections metric - Alert if connections exceed 400 (80% of 500 max) - Idle connections recycled after 1 hour

Direct Database Access (for operations)

Via AWS Console: 1. RDS Dashboard → Databases → menotime-prod 2. Click "Create database proxy" for connection pooling (optional) 3. Database cannot be publicly accessed; use AWS Secrets Manager for credentials

Via Bastion Host (recommended for production access):

# SSH to bastion (jump host in public subnet)
ssh ec2-user@bastion.menotime.ai

# Connect to database
psql -h `menotime-prod.xxxxx.us-west-1.rds.amazonaws.com` -U menotime_user -d menotime_prod

Via AWS Systems Manager Session Manager (zero-trust):

aws ssm start-session --target `i-xxxxxxxxx`

# Inside session:
psql -h `menotime-prod.xxxxx.us-west-1.rds.amazonaws.com` -U menotime_user -d menotime_prod

Credentials from Secrets Manager:

aws secretsmanager get-secret-value \
  --secret-id menotime/db/prod \
  --query SecretString \
  --output text | jq .

Performance Insights & Monitoring

Enable Performance Insights

Already enabled in staging and production.

What it tracks: - Active session analytics (database load over time) - SQL queries and execution plans - Wait events (where database time is spent) - Resource utilization (CPU, I/O, lock contention)

CloudWatch Metrics (Primary Monitoring)

Metric	Threshold	Action
DatabaseConnections	>400	Investigate connection pool; consider scale-up
CPUUtilization	>75% sustained	Scale up to db.m7g.xlarge or optimize queries
DatabaseMemoryUsagePercentage	>85%	Increase shared_buffers or reduce cache size
DiskQueueDepth	>10	Investigate slow queries; increase IOPS
ReadLatency	>5ms sustained	Check I/O utilization; consider provisioned IOPS
WriteLatency	>10ms sustained	Check WAL activity; increase max_wal_size
BinLogDiskUsage	>80GB	Increase log retention or archival
ReplicaLag	>1s	Not applicable (Single-AZ currently)

Performance Insights Queries

Top Queries by Load (sample):

SELECT
  query_hash,
  query_text,
  calls,
  total_time,
  mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

Identifying Lock Contention:

SELECT
  pid,
  usename,
  query_start,
  state_change,
  query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start;

Table Sizes (to identify bloat):

SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Scaling & Capacity Planning

Current Capacity

db.m7g.large supports: - ~500 active connections - ~50,000 transactions/second (OLTP workload) - ~1TB of data (expandable)

Growth Milestones

Patient Volume	Recommendations	Action
\<500	Current config (db.m7g.large)	Monitor CPU/memory
500-2,000	Consider Multi-AZ (high availability)	Enable Multi-AZ in production
2,000-5,000	Monitor closely; prepare upgrade plan	Test upgrade to db.m7g.xlarge in staging
>5,000	Upgrade to db.m7g.xlarge or larger	Implement upgrade during maintenance

Upgrade Path

Scaling Up (no downtime with Multi-AZ): 1. Upgrade RDS instance class in AWS Console 2. If Multi-AZ: upgrade applied to standby first, then failover (30-60 sec downtime) 3. If Single-AZ: downtime during upgrade (1-2 minutes) 4. Monitor for 30 minutes post-upgrade

Command:

aws rds modify-db-instance \
  --db-instance-identifier menotime-prod \
  --db-instance-class db.m7g.xlarge \
  --apply-immediately  # or --no-apply-immediately for maintenance window

Multi-AZ Migration (Recommended for Production)

Current: Single-AZ (cost: baseline) Multi-AZ: Synchronous replication across availability zones (cost: +50%)

Benefits: - Automatic failover if AZ fails (RTO: 60-120 seconds) - Backups taken from standby (no performance impact) - Planned maintenance with minimal downtime

Process:

aws rds modify-db-instance \
  --db-instance-identifier menotime-prod \
  --multi-az \
  --apply-immediately

Backup & Disaster Recovery

Backup Verification Checklist

Weekly Task:

□ Verify automated backup completed (CloudWatch Backup dashboard)
□ Test restore from backup to staging (every month)
□ Check backup size and retention policy
□ Review S3 snapshot copy status

Disaster Recovery Scenarios

Scenario 1: Accidental Data Deletion

Recovery time: \<5 minutes

# 1. Identify the time of deletion
# 2. Restore to point-in-time (before deletion)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier menotime-prod \
  --target-db-instance-identifier menotime-prod-recovery \
  --restore-time 2024-02-01T10:00:00Z \
  --use-latest-restorable-time

# 3. Verify data restored correctly
# 4. Promote recovery instance to primary (DNS cutover)
# 5. Monitor for 30 minutes

Scenario 2: RDS Instance Failure

Recovery time: \<2 minutes (Multi-AZ) or \<10 minutes (restore from backup)

With Multi-AZ: - Automatic failover to standby replica - No manual intervention required - DNS updated automatically

Without Multi-AZ: - Restore from latest automated backup - May lose up to 24 hours of data - Consider enabling Multi-AZ immediately

Scenario 3: Entire Region Failure (unlikely)

Recovery time: >30 minutes

# 1. Copy RDS snapshot to alternate region (us-east-1)
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier menotime-prod-backup \
  --target-db-snapshot-identifier menotime-prod-recovery-us-east-1 \
  --source-region us-west-1 \
  --destination-region us-east-1

# 2. Restore from snapshot in us-east-1
# 3. Update application DNS to us-east-1 endpoint
# 4. Validate data integrity

Database Maintenance

Monthly Tasks

First Week: - Review CPU/memory utilization trends - Check connection pool health - Verify Performance Insights no bottlenecks

Second Week: - Run VACUUM ANALYZE (full table optimization) - Review and archive old audit logs - Check index fragmentation (reindex if >20% bloat)

Third Week: - Test backup restoration (staging environment) - Review slow query logs - Identify candidates for query optimization

Fourth Week: - Update parameter group if needed - Plan capacity for upcoming month - Review cost allocation and instance utilization

Quarterly Tasks

Full database integrity check (REINDEX)
Major version compatibility testing (in staging)
Disaster recovery drill (restore and validate)
Security audit (IAM access, audit logs review)

Operational Commands

Common Operations

Check Database Size:

aws rds describe-db-instances \
  --db-instance-identifier menotime-prod \
  --query 'DBInstances[0].AllocatedStorage'

View Current Connections:

# Connect to database
psql -h `menotime-prod.xxxxx.us-west-1.rds.amazonaws.com` -U menotime_user -d menotime_prod

# List active connections
SELECT count(*) FROM pg_stat_activity;

Kill a Long-Running Query:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE query LIKE 'SELECT %' AND state = 'active' AND query_start < now() - interval '1 hour';

Export Data to S3 (for analysis):

aws rds start-export-task \
  --export-task-identifier menotime-prod-export-20240201 \
  --source-arn arn:aws:rds:us-west-1:ACCOUNT_ID:db:menotime-prod \
  --s3-bucket-name menotime-backups \
  --s3-prefix exports/
  --export-only patients clinical_notes  # optional: specific tables

Summary

MenoTime's database foundation is built on RDS PostgreSQL with automated backups, encryption, and HIPAA-compliant audit logging. The db.m7g.large instance provides ample capacity for current and near-term growth, with clear upgrade and Multi-AZ migration paths for future scaling.

Key operational practices: - Daily automated backups (7-day retention) - Alembic-managed schema migrations - Performance Insights for optimization - Monthly maintenance and validation - Regular disaster recovery testing

For deployment procedures, see ECS Fargate. For monitoring, see Monitoring.