Key Takeaways
- Healthcare workflows span multiple bounded contexts — scheduling, insurance, clinical records, billing — that cannot be wrapped in a single database transaction. Event sourcing gives you a natural audit trail as a byproduct of the architecture rather than a bolt-on system, which matters in a domain where regulators need to know not just the current state of a record but exactly how it got there.
- Choosing between RabbitMQ, SQS, and Kafka for healthcare messaging involves real trade-offs. We landed on SQS with SNS fan-out for most workflows (zero infrastructure management with a small DevOps team) and RabbitMQ via Amazon MQ for workflows needing priority queues and sophisticated routing. Kafka was overkill for our throughput.
- The saga pattern with compensating transactions is essential for multi-step clinical processes, but getting compensation logic right is harder than the happy path. The most common mistake is treating compensation as simple undo — in healthcare, rolling back a step often means transitioning to a different valid state rather than deleting the work.
The Workflow Challenge in Healthcare Operations
A single patient encounter in healthcare triggers a cascade of dependent processes: scheduling confirmation, insurance eligibility verification, clinical documentation, charge capture, claim submission, payment posting, and patient billing. Each step involves different systems, different teams, and different timing constraints. When any step fails or stalls, the downstream impact ripples through the entire workflow.
The systems we inherited were running these workflows through a combination of manual handoffs, cron-scheduled batch jobs, and point-to-point REST integrations. The problems were predictable: tasks fell through the cracks, staff spent hours chasing status updates across systems, and cascading failures happened several times per week. When the practice management system was slow to respond, the eligibility check would time out, the scheduling confirmation never fired, and the patient showed up for an appointment that the front desk did not know was unverified. Each incident took about 45 minutes of staff time to untangle.
The core engineering challenge was clear: we needed to decouple these interdependent processes so that failures in one system do not cascade, while maintaining end-to-end visibility and a complete audit trail for compliance.
Requirements and Constraints
- Process thousands of patient encounters per week across practices with varying EHR systems (Dentrix, Eaglesoft, Open Dental, and a custom clinical platform). No single integration approach works for all of them.
- Maintain complete audit trails for every workflow step — timestamps, actor identification, state transitions. Required for HIPAA and state regulatory compliance, and non-negotiable.
- Support both synchronous workflows (real-time eligibility checks during scheduling) and asynchronous workflows (claim submission batching, appeal processing with multi-day payer response times).
- Handle partial failures gracefully. A workflow that fails at step 4 of 7 should not leave orphaned records in the first 3 systems and require manual cleanup.
Why Event-Driven Architecture for Healthcare
We evaluated three architectural approaches: pure orchestration (a central workflow engine like Camunda or Temporal running every process), pure choreography (event-driven with no central coordinator), and a hybrid. We chose the hybrid — event-driven choreography for loosely coupled processes and orchestration via Temporal for tightly coupled clinical workflows that require strict ordering guarantees. Pure choreography was tempting for its simplicity, but clinical workflows have enough ordering constraints that a choreographed approach would have pushed too much coordination logic into individual services.
Message Broker Selection: RabbitMQ vs SQS vs Kafka
This decision came down to operational complexity versus feature requirements. Kafka was the first option we ruled out — our peak throughput is around 800 events per second, and Kafka's strengths (log compaction, high-throughput streaming, consumer group rebalancing) solve problems we do not have. The operational overhead of running Kafka clusters was not justified for our scale.
Between SQS and RabbitMQ, we ended up using both. SQS with SNS fan-out handles the majority of our messaging: it requires zero infrastructure management, scales automatically, and is covered under the AWS BAA for HIPAA compliance. For workflows needing more sophisticated routing — topic-based routing, dead letter exchanges with configurable retry policies, and priority queues for urgent clinical events — we run RabbitMQ on Amazon MQ. The trade-off is that RabbitMQ requires managing broker instances, monitoring queue depth, and handling failover, but for the subset of workflows that need these features, it is worth it.
For event storage and archival, we use Amazon EventBridge as the event bus with S3-backed archival. Every event is automatically archived in partitioned Parquet format — this gives us both real-time routing and cost-effective long-term storage. The archive retains events for 7 years per HIPAA requirements, and we can replay any time window into any consumer for debugging or reprocessing.
Event Schema Design
We defined a standardized event envelope that every service must use: event ID, timestamp, correlation ID, causation ID, source service, schema version, and a domain-specific payload. The correlation ID is the most operationally useful part — it chains related events across the entire workflow, so tracing a patient's journey from scheduling through claim payment is a single query against the event store. Schema versioning follows a backward-compatible evolution strategy: new fields can be added but existing fields are never removed or renamed. This avoids the painful coordination of synchronized service deployments.
Event Sourcing Implementation
For core domain entities — patient encounters, claims, and treatment plans — we implemented full event sourcing. Instead of storing the current state of a claim in a database row and overwriting it on each update, we store the complete sequence of events: ClaimCreated, LineItemAdded, EligibilityVerified, ClaimSubmitted, RemittanceReceived, PaymentPosted, and so on.
The motivation was a compliance requirement that turned out to be architecturally beneficial. Regulators need to know not just the current state of a claim, but how it got there — who touched it, when, and what changed. With a traditional CRUD model, you need a separate audit log that can drift out of sync. With event sourcing, the audit trail is the data model. You get it for free.
Event Store Design
The event store runs on PostgreSQL with a schema optimized for append-only writes and stream reads. Each aggregate (claim, encounter, treatment plan) has its own event stream identified by a composite key of aggregate type and ID. We use optimistic concurrency control with stream version numbers — if two services try to append to the same stream simultaneously, one gets a concurrency conflict and must reload current state before retrying. This is rare in practice (conflicts happen when two users modify the same claim simultaneously), but the system handles it gracefully rather than silently losing writes.
- Write performance: The event store handles 1,200 events per second sustained, with burst capacity to 3,500. Our peak load is about 800 events per second during Monday morning scheduling rushes, so we have comfortable headroom.
- Read projections: Event streams are projected into read-optimized views using async projections. The claims dashboard, for example, consumes claim events and maintains a denormalized view in Elasticsearch for filtering, aggregation, and full-text search. The projection can fall a few seconds behind during peak load, which is acceptable for dashboard views.
- Snapshots: For aggregates with long event histories (some treatment plans accumulate 200+ events over multi-year treatment courses), we generate periodic snapshots every 50 events to avoid replaying the entire stream on every read. Without snapshots, loading a long-running treatment plan takes several hundred milliseconds — with snapshots, it is consistently under 20ms.
Saga Patterns for Clinical Workflows
Healthcare workflows frequently span multiple bounded contexts. A patient onboarding process touches scheduling, insurance verification, clinical records, and billing setup. These cross-context processes cannot be wrapped in a single database transaction, so we use the saga pattern to coordinate them with explicit compensation logic for failure scenarios.
Patient Onboarding Saga Example
The onboarding saga has seven steps: (1) create patient record in the clinical system, (2) verify insurance eligibility, (3) check prior authorization requirements, (4) create the financial account, (5) assign a care coordinator, (6) send welcome communications, (7) schedule the initial consultation. Each step publishes a completion event triggering the next, and each has a compensating action for rollback.
The important nuance is that compensating actions in healthcare are not simple deletes. If step 4 fails because the insurance plan is not contracted, the compensation for step 5 is not "delete the coordinator assignment" — it is "mark the assignment as pending insurance resolution." The patient record from step 1 is not deleted either; it transitions to a holding state. Healthcare data generally cannot be destroyed, only transitioned. Getting this right required close collaboration with the clinical and billing teams to understand every failure mode and its appropriate resolution state.
- Saga coordinator: We use Temporal as the saga coordinator for complex workflows. Temporal maintains saga state durably, handles retries with configurable backoff, and provides visibility into which step each instance is executing. The observability alone justified the choice — before Temporal, debugging a stuck workflow meant SSH-ing into servers and reading log files.
- Timeout tuning: Each saga step has a configurable timeout. Insurance eligibility checks have a 30-second timeout with 3 retries. Some payer APIs are slow — we learned to set generous timeouts (up to 2 minutes for certain payers) to avoid unnecessary saga failures from slow external dependencies.
- Dead letter processing: Sagas that exhaust all retries go to a dead letter queue with full context about the failure point, current state of all participants, and compensation actions already executed. An operations dashboard surfaces these for manual resolution.
Claim Lifecycle Saga
The claim lifecycle saga is the most complex workflow — it spans from charge capture through final payment posting. It includes conditional branches (does the claim need prior authorization?), parallel execution paths (submit to primary and secondary insurance simultaneously when applicable), and long-running waits (payer adjudication can take 30-45 days). Temporal's durable execution model handles these well — a saga can sleep for weeks waiting on a payer response and resume exactly where it left off when the response arrives. Without durable execution, we would need to persist workflow state manually and rebuild it on every external event, which is error-prone.
Clinical State Machines
Within each workflow step, clinical entities follow well-defined state machines. A treatment plan moves through states: Draft, PendingReview, Approved, Active, InProgress, PendingModification, Completed, or Cancelled. Each transition has guard conditions (a plan cannot move to Active until insurance authorization is confirmed) and triggers side effects (moving to Active creates the first appliance order and notifies the lab).
Why XState
We model state machines using XState because it gives us formal definitions that are both executable and visualizable. The visualization aspect proved more valuable than we expected. During the design phase, we showed state diagrams to clinical staff and they immediately identified three missing transitions: "patient requests hold" (pause without canceling), "lab requests clarification" (bounce an order back for more clinical information), and "insurance downgrades coverage" (requiring a treatment plan modification mid-stream). These were real scenarios that happen regularly and would have been bugs in production if we had not caught them during design review.
- Guard conditions: Each transition evaluates preconditions at runtime. The transition from PendingReview to Approved requires that the reviewing clinician has appropriate credentials for the treatment type and that the patient's insurance verification has not expired during the review period. Guards prevent invalid state transitions that could lead to compliance issues.
- Parallel states: Treatment plans can be in multiple states simultaneously using XState's parallel state feature. A plan might be Active (clinical state), PendingReauthorization (insurance state), and InFabrication (lab state). Each region evolves independently while the composite state drives the overall workflow logic. This models the real world accurately — clinical, insurance, and lab processes do not wait for each other.
- State persistence: Machine instances are serialized to PostgreSQL on every transition, with the full context stored as JSONB. This allows reconstructing the exact state of any workflow at any point in time, which is essential for dispute resolution and compliance investigations.
Audit Logging & Compliance
The most practical benefit of event-driven architecture in healthcare is that audit logging comes as a natural byproduct. Every state change, decision point, and inter-service communication is an immutable event with a timestamp, actor identity, and full before/after state. This audit trail is the primary source of truth, not a secondary system that might fall out of sync.
Compliance Report Generation
We built a compliance reporting service that queries the event store and generates reports tailored to specific regulatory requirements. For HIPAA audits, it shows every PHI access with the user, timestamp, purpose, and data elements viewed or modified. For SOC 2, it demonstrates segregation of duties and change management controls. For state health department reviews, it provides workflow evidence showing that required steps (like informed consent) were completed before treatment. Each of these reports previously required manual assembly from multiple systems — now they are generated from a single event store query.
- Access audit: Every API call touching PHI is logged with the authenticated user, their role, the specific data elements accessed, and the business justification derived from the workflow context. This satisfies HIPAA's minimum necessary requirement by demonstrating that each access was scoped to the data needed for the specific task.
- Change tracking: Because event sourcing stores every state change, we can produce a complete change history for any entity — who changed what, when, and why. This eliminated the need for a separate change data capture system.
- Anomaly detection: A real-time analytics pipeline monitors the event stream for access patterns that deviate from normal behavior — a staff member accessing records outside their assigned location or viewing an unusually high volume of records. Anomalies trigger alerts to the compliance team within 15 minutes.
Results & Operational Impact
After six months in production, here is what we can measure. These are real numbers, not projections.
- Workflow reliability: End-to-end workflow failures dropped from about 12% to under 2%. The remaining failures are almost entirely caused by external system outages (payer APIs, clearinghouse downtime) that the saga compensation logic handles without manual intervention.
- Claim processing speed: Average time from encounter to clean claim submission dropped from 9.2 days to 2.8 days. The improvement came from eliminating manual handoffs and retry logic — the system automatically handles the steps that previously required someone to notice a failure and restart the process.
- Manual intervention: The operations team reduced manual workflow fixes from around 340 hours per month to about 45. Most of what remains is handling edge cases that the system correctly identifies as requiring human judgment.
- Audit preparation: Compliance audit preparation dropped from weeks of work to days. Auditors have commented on the completeness and traceability of the evidence, which is a direct result of the event sourcing approach — every piece of audit evidence is a query, not a manual assembly job.
- Scalability: The platform onboarded 12 new practices during the first six months without architectural changes or performance degradation. Message throughput scaled linearly, and infrastructure cost increase was modest despite significant encounter volume growth.
The architecture is not without trade-offs. Event sourcing adds complexity to the development model — developers need to think in terms of events and projections rather than CRUD operations, which has a learning curve. Schema evolution, while manageable with our backward-compatible strategy, requires more discipline than simply adding a database column. And the eventual consistency of read projections occasionally surprises users who expect immediate read-after-write consistency. These are real costs that we accepted in exchange for the audit, reliability, and decoupling benefits.
Workflow Automation
Building event-driven systems for healthcare?
We have experience designing event sourcing, saga patterns, and state machine architectures for clinical and administrative healthcare workflows. Happy to compare notes on your approach.
Talk to Our Healthcare Team