Blog Healthcare

Designing HIPAA-Compliant AWS Infrastructure: VPC Architecture, Encryption, and Disaster Recovery

Key Takeaways

  • The AWS BAA covers specific HIPAA-eligible services, but signing a BAA does not make your infrastructure compliant. That is entirely on your architecture decisions — which services you use, how you configure them, and whether you enforce encryption and access controls consistently. The shared responsibility model is the most important concept to internalize before you start building.
  • VPC architecture for PHI isolation follows a straightforward pattern: three-tier subnets across multiple AZs, VPC endpoints to keep AWS API traffic off the public internet, security groups referencing group IDs (not CIDR blocks) for access control, and no direct internet access for anything except the load balancer. The pattern is well-established, but the details matter — a single misconfigured security group can undo the entire design.
  • Disaster recovery planning for healthcare is not optional, and an untested DR plan is not a plan. We target 15-minute RPO and 45-minute RTO using cross-region RDS backup replication, S3 CRR, and Terraform-managed infrastructure. Quarterly DR tests revealed problems (missing Terraform modules, incorrect IAM policies in the DR region) that would have been discovered during an actual outage if we had not tested.

The Compliance Starting Point

HIPAA compliance for a healthcare SaaS platform is fundamentally an infrastructure problem. Application-level security matters, but if the database sits in a public subnet with default credentials, no amount of input validation will save you. This post covers how we designed a HIPAA-compliant AWS infrastructure from the ground up, including the trade-offs we made and the things we learned the hard way.

The starting point was an existing platform that had grown organically: EC2 instances in a default VPC, an RDS database with encryption disabled, S3 buckets with inconsistent policies, and CloudWatch logs that nobody monitored. It functioned, but it would not survive a compliance audit. Rather than retrofitting, we designed a new infrastructure and executed a zero-downtime migration.

The scope included: VPC redesign, encryption at every layer, WAF and DDoS protection, centralized audit logging, disaster recovery procedures, and documentation sufficient for both automated scanning tools (AWS Config, Security Hub) and human auditors.

The Shared Responsibility Model

The most important concept to understand before building HIPAA infrastructure on AWS is the shared responsibility model. AWS provides HIPAA-eligible services and signs a Business Associate Agreement (BAA), but the customer is responsible for configuring those services correctly, managing access controls, encrypting PHI, and maintaining audit logs. "AWS is HIPAA-eligible" does not mean "my workload is HIPAA-compliant." That distinction trips up a lot of teams.

AWS HIPAA-Eligible Service Selection

The first constraint: every service that touches, processes, stores, or transmits PHI must be on the AWS HIPAA-eligible services list and covered under the BAA. This eliminated several convenient services the team was using and required finding compliant alternatives.

Service Architecture Decisions

  • Compute: ECS Fargate. We chose Fargate over EC2 to eliminate OS patching and host hardening. Each service runs in its own task definition with least-privilege IAM roles. The operational simplicity trade-off is worth it — one less category of things to patch and audit.
  • Database: RDS PostgreSQL with Multi-AZ, KMS encryption (AES-256), 35-day backup retention. We evaluated Aurora but chose standard RDS for cost predictability at our scale (about 200GB, growing 5GB/month). Aurora's pricing model makes more sense at higher throughput or when you need read replicas for analytics.
  • Storage: S3 with default SSE-KMS encryption, versioning, and bucket policies that deny non-TLS requests. A dedicated PHI bucket has additional access logging and MFA-delete protection. Non-PHI assets (CSS, JS, images) live in a separate bucket with simpler policies.
  • Caching: ElastiCache Redis with encryption in transit and at rest. Used for session management, rate limiting, and caching non-PHI reference data (fee schedules, code descriptions). PHI is never cached — always served from the encrypted database. This is a deliberate simplification that avoids cache invalidation bugs leading to stale PHI.
  • Messaging: SQS with SSE-KMS for queues carrying PHI references. Dead letter queues with encryption and 14-day retention for debugging failed processing.
  • Search: OpenSearch (Elasticsearch) in VPC mode with at-rest encryption, node-to-node encryption, and fine-grained access control. PHI fields stored as encrypted attributes with field-level policies.

Services We Avoided

Several common AWS services were not on the HIPAA-eligible list or presented compliance risks. We avoided Lambda@Edge for anything that might encounter PHI (regional Lambda behind API Gateway instead). We deployed Keycloak on Fargate rather than using Cognito, due to concerns about PHI leaking into user attributes. We limited CloudFront to static assets with sanitized URLs, using ALB directly for API traffic. Each of these decisions reduced convenience but simplified the compliance posture — fewer services touching PHI means fewer things to audit and fewer potential exposure points.

VPC Architecture & Network Security

The VPC follows defense-in-depth. Nothing is publicly accessible except the ALB, and that sits behind WAF. Everything else — application services, databases, caches, queues — runs in private subnets with no direct internet access.

Subnet Layout

Three-tier architecture across three Availability Zones (us-east-1a, 1b, 1c). Public tier has only the ALB and NAT Gateways. Application tier has ECS Fargate tasks. Data tier has RDS, ElastiCache, and OpenSearch. Traffic between tiers is controlled by security groups with explicit allow rules — default is deny-all.

  • VPC endpoints: Interface endpoints for S3, SQS, KMS, CloudWatch Logs, ECR, and Secrets Manager keep all AWS API traffic within the AWS network. This is a security improvement (traffic never crosses the public internet, even through NAT) and also reduced NAT Gateway data processing costs by about 40%.
  • Security groups as identity: Access rules reference security group IDs, not CIDR blocks. The database security group allows PostgreSQL traffic only from the application security group. This means a compromised application container cannot attack other containers or bypass database access rules — the security group enforces the communication topology.
  • Network ACLs: An additional defense layer blocking known-bad IP ranges, restricting outbound traffic to necessary destinations (AWS endpoints, specific third-party API ranges), and logging all denied traffic.
  • Flow logs: VPC Flow Logs enabled on all subnets and ENIs. Logs go to CloudWatch for real-time alerting and S3 in Parquet format for long-term retention and Athena analysis during investigations.

Access Patterns

There is no bastion host. All production access goes through AWS Systems Manager Session Manager — shell access without opening inbound ports, without SSH keys to manage, and with a complete audit trail of every command. Sessions log to CloudWatch and S3. For database access during incidents, we spin up a temporary ECS task running pgAdmin connected to RDS through the private subnet, accessible only via Session Manager port forwarding. This is more cumbersome than a direct database connection, but it means there is no persistent database access path that could be compromised.

Encryption Strategy: At Rest & In Transit

HIPAA classifies encryption as an "addressable" implementation specification, which does not mean optional — it means you must implement it or document why an equivalent alternative is appropriate. We chose to encrypt everything without exception. It is simpler to enforce a universal encryption policy than to track which data is PHI and which is not, and the performance overhead of modern encryption is negligible.

Encryption at Rest

  • KMS key management: Customer-managed keys (CMKs) rather than AWS-managed keys for all PHI resources. CMKs give control over rotation schedules, key policies (separate encrypt/decrypt permissions), and the ability to revoke access by disabling the key — required by our data destruction policy when offboarding clients.
  • RDS encryption: AES-256 via KMS for all data at rest, including backups, replicas, and snapshots. Important: encryption must be enabled at database creation — it cannot be added to an existing instance. This is why we built new infrastructure rather than retrofitting the old database.
  • S3 encryption: Bucket policies deny any PutObject without the SSE-KMS header. The PHI bucket uses a dedicated KMS key separate from the application logs key, enabling independent access control and audit trails.
  • Field-level encryption: Beyond storage encryption, we use the AWS Encryption SDK for application-level field encryption of the most sensitive elements (SSNs, full names, dates of birth). These fields are encrypted before reaching the database, so a DBA with full RDS access sees only ciphertext. Decryption requires a separate KMS key granted only to specific services with documented need. This is envelope encryption in practice — the field data is encrypted with a data key, and the data key is encrypted with the KMS key.

Encryption in Transit

All communication uses TLS 1.2 minimum, with TLS 1.3 preferred. The ALB terminates external TLS using ACM certificates with automatic renewal. Internal service-to-service communication uses mutual TLS (mTLS) with certificates from ACM Private CA. Even within the private VPC, traffic between services is encrypted and authenticated — a compromised container cannot impersonate another service without the correct client certificate.

Database connections use SSL with sslmode=verify-full. Redis uses ElastiCache in-transit encryption. DNS queries go through Route 53 Resolver DNS Firewall to prevent exfiltration, with query logging for monitoring. The mTLS setup adds operational complexity (certificate rotation, CA management) but eliminates an entire class of lateral movement attacks.

WAF, Shield & DDoS Protection

Healthcare data is a high-value target — patient records sell for significantly more than credit card numbers because they enable identity theft, insurance fraud, and prescription fraud. Our security posture assumes the application will be actively targeted.

WAF Configuration

AWS WAF on the ALB with a layered rule set combining managed rules and custom rules.

  • Managed rules: Core Rule Set, Known Bad Inputs, SQL Injection, and Linux OS rule groups. These catch the bulk of common attacks with minimal tuning.
  • Rate limiting: 2,000 requests per 5 minutes for general endpoints, 100 per 5 minutes for auth endpoints. Thresholds are set per-endpoint based on observed legitimate traffic — too aggressive and you block real users, too permissive and you let attacks through.
  • Geo-blocking: The platform serves US-based practices only, so we block traffic from countries with no users. This reduced automated scanning noise by roughly 70% and simplified monitoring.
  • Custom rules: Block suspicious header and parameter patterns, enforce max request body sizes, and require correct content-type headers for API endpoints.
  • Bot control: AWS Bot Control identifies automated traffic from known bot networks while allowing legitimate automation (monitoring, partner APIs) via allowlist.

Shield Advanced

We enrolled the ALB in Shield Advanced for DDoS protection. The main value is always-on network monitoring, automatic mitigation within seconds, and access to the AWS DDoS Response Team. The cost protection feature matters too — Shield Advanced refunds scaling charges during attacks, capping the financial impact. At $3,000/month, it is a straightforward cost for a platform where downtime affects patient care. The decision was easy once we framed it as insurance rather than a feature.

Audit Logging with CloudTrail and GuardDuty

Audit logging serves two purposes: HIPAA compliance and practical security operations. The logging architecture captures three event categories: AWS API activity (CloudTrail), application-level events (who accessed which patient record), and network activity (VPC Flow Logs).

CloudTrail Configuration

  • Organization trail: A single trail at the AWS Organization level captures management and data events across all accounts. No individual account can disable its own logging.
  • Data events: Enabled selectively for S3 operations on PHI buckets, Lambda invocations, and DynamoDB operations on session tables. Data events are expensive at scale, so we only enable them on PHI-touching resources rather than blanket-enabling everything.
  • Log integrity: SHA-256 digest files generated hourly verify that logs have not been tampered with. During audits, we run the validate-logs command against sample time ranges to demonstrate integrity.
  • Log storage: Logs go to a dedicated S3 bucket in a separate security account that application engineers cannot access. Versioning, MFA-delete, Glacier transition after 90 days, 7-year retention per HIPAA requirements.
  • Real-time alerting: CloudTrail events stream to CloudWatch Logs where metric filters trigger alarms for high-severity events: root account usage, IAM policy changes, security group modifications, KMS key deletion attempts. Alerts go to the security team via PagerDuty within 5 minutes.

GuardDuty and Security Hub

GuardDuty provides continuous threat detection by analyzing CloudTrail logs, VPC Flow Logs, and DNS logs for suspicious patterns — unusual API calls, unauthorized access attempts, compromised instances communicating with known malicious IPs. Security Hub aggregates findings from GuardDuty, CloudTrail, Inspector, Macie, and Config into a single view. We enable the CIS AWS Foundations Benchmark and AWS Foundational Security Best Practices standards, which evaluate infrastructure against 200+ controls continuously. Any HIGH or CRITICAL finding creates a Jira ticket with a 24-hour investigation SLA and 72-hour remediation SLA.

Disaster Recovery & Business Continuity

Healthcare platforms cannot tolerate extended outages — clinicians rely on them for records, treatment planning, and billing during patient appointments. Our DR design targets RPO of 15 minutes and RTO of 45 minutes for a full regional failure.

Replication Strategy

  • Primary region (us-east-1): Multi-AZ for all production workloads. RDS Multi-AZ provides synchronous replication with automatic failover (under 60 seconds). ECS services run across three AZs with minimum 2 healthy tasks per service.
  • DR region (us-west-2): RDS automated backups copied cross-region. S3 PHI buckets replicated via CRR with encryption maintained through a KMS key grant in the destination region. These replicated resources allow standing up a functional environment in us-west-2 within our RTO target.
  • Point-in-time recovery: RDS PITR enabled with continuous backup, allowing restoration to any second within a 35-day window. This addresses data corruption scenarios (accidental deletions, application bugs) as opposed to infrastructure failures.
  • Infrastructure as Code: The entire infrastructure is in Terraform with state stored in S3 in the DR region. During regional failure, terraform apply in us-west-2 provisions identical infrastructure, the database restores from the latest cross-region backup, and DNS updates to the new region. The full procedure is documented as a runbook.

DR Testing

An untested DR plan is a hope, not a plan. We run quarterly DR tests: full failover to us-west-2, verify all functionality against restored data, measure actual RPO/RTO, then fail back. Each test is documented with timestamps, deviations, and lessons learned.

The first test revealed two missing Terraform modules and took 73 minutes (against our 45-minute target). By the fourth test, we consistently hit 38-42 minutes. Every test has found at least one issue — an expired cross-region IAM role, a missing DNS record, a Terraform provider version mismatch. These are exactly the problems you want to discover during a planned test rather than during an actual outage at 2 AM.

  • Backup verification: Daily automated restore of the latest RDS snapshot to a temporary instance, running data integrity checks (row counts, checksum validation, referential integrity), then terminating. This catches backup corruption that would otherwise only surface during actual recovery.
  • Chaos engineering: AWS Fault Injection Simulator periodically injects failures — terminating ECS tasks, degrading cross-AZ connectivity, simulating RDS failovers — during business hours without advance notice to the operations team. This verifies that auto-scaling, health checks, and alerting respond correctly.
  • Communication plan: The DR plan includes templates and a distribution process. Automated StatusPage updates trigger within 5 minutes of an outage, detailed communications go to administrators within 15 minutes, and post-incident reports publish within 48 hours.

The total infrastructure cost runs about $14,200/month — roughly 2.3x the original non-compliant setup. The premium covers Multi-AZ deployments, encryption overhead, WAF, Shield Advanced, enhanced logging, and DR replication. This is the cost of operating in healthcare. The infrastructure is table stakes for enterprise healthcare contracts, and the compliance posture it provides is what enables the platform to compete for those contracts.

HIPAA-Compliant Infrastructure

Need to design HIPAA-compliant cloud infrastructure?

We have built and operated HIPAA-compliant AWS environments for healthcare SaaS platforms. If you are planning a compliance migration or building from scratch, we can help with architecture and implementation.

Talk to Our Healthcare Team

You might also like

More from our Healthcare practice

Stay sharp with our stories

Get healthcare tech insights in your inbox.

We hit send on the second and fourth Thursday.