Blog Healthcare

Architecting an NLP Pipeline for Ambient Clinical Documentation

Key Takeaways

  • Clinical speech-to-text is a different problem than general ASR. Background noise, overlapping speakers, and medical jargon require domain-specific models and edge preprocessing to get usable transcripts.
  • A two-stage NLP pipeline that separates transcription from clinical entity extraction makes each component independently testable and lets you swap out the ASR layer without retraining downstream models.
  • FHIR R4 integration with EHR systems like Epic requires careful mapping between free-text clinical concepts and structured vocabulary codes (SNOMED CT, ICD-10, RxNorm), and the edge cases in that mapping dominate development time.
  • Edge processing for voice activity detection and speaker diarization reduces the volume of audio sent to the cloud, which simplifies HIPAA compliance by limiting PHI transmission.
  • Physician trust is the real bottleneck. Showing confidence scores and evidence traces for each generated note section mattered more for adoption than raw accuracy numbers.

The Charting Burden in Modern Healthcare

Documentation consumes a disproportionate share of a physician's day. Studies consistently show that for every hour of direct patient care, physicians spend one to two hours on EHR tasks. The consequences are well-documented: burnout, reduced patient throughput, and after-hours "pajama time" charting that erodes quality of life. This is the problem space ambient clinical documentation systems aim to address.

The core engineering challenge is straightforward to describe and difficult to execute: record the natural conversation between a physician and patient during an encounter, extract the clinically relevant information, structure it into a SOAP note (Subjective, Objective, Assessment, Plan), and push that note into the EHR for physician review. The system needs to do this without requiring the physician to change how they practice medicine, which rules out dictation-style workflows where the doctor speaks in structured sentences.

This post walks through the architecture we built to solve this problem, the trade-offs we made along the way, and the parts that turned out to be harder than we expected.

Architecture Overview: From Speech to Structured Note

The system is a four-layer pipeline. Each layer is independently deployable and has its own accuracy metrics, which lets us isolate regressions quickly when something breaks in production. The layers are: audio capture and edge processing, speech-to-text with medical vocabulary, clinical NLP and note structuring, and SOAP note assembly with EHR write-back.

Layer 1: Audio Capture and Edge Processing

The first layer runs on a compact ARM-based device in each exam room. We chose edge processing over streaming raw audio to the cloud for two reasons. First, latency: performing voice activity detection (VAD) and initial speaker diarization on-device means we only transmit speech segments that contain clinically relevant content. This cut the volume of audio leaving the exam room by roughly 60%. Second, compliance: less PHI in transit means a simpler HIPAA posture. The edge device does not store any audio persistently; it processes and forwards.

The device uses a dual-microphone array with beamforming for speaker isolation. We run a lightweight speaker identification model (around 12MB) that classifies who is speaking in under 8ms. This is necessary because SOAP note generation depends on attribution: when a patient says "chest pain for three days," that maps to the Subjective section, while a physician saying "lungs clear bilaterally" maps to Objective. Getting attribution wrong cascades into incorrect note structure.

Layer 2: Speech-to-Text with Medical Vocabulary

Processed audio segments stream to our ASR service, which runs a fine-tuned Whisper-large-v3 model augmented with a medical vocabulary layer. Off-the-shelf ASR models typically land around 85-88% word accuracy on medical speech. The gap to clinically useful accuracy (we targeted 95%+) comes from medical jargon, drug names that sound alike, and the way physicians speak in shorthand. We fine-tuned on de-identified clinical conversations across multiple specialties, which moved accuracy into the mid-90s on our benchmark set.

The most impactful addition was a contextual language model (CLM) that runs alongside the primary ASR. Before each encounter, the CLM loads the patient's active problem list, current medications, and the physician's specialty from the EHR. This context adjusts token probabilities during decoding. When a cardiologist says something ambiguous between "Metoprolol" and "Metformin," the CLM resolves it based on the patient's medication list and the physician's prescribing patterns. This context-aware boosting was worth several percentage points of accuracy on its own.

Layer 3: Clinical NLP and Note Structuring

The raw transcript flows into a clinical NLP engine that performs medical named entity recognition (NER), relation extraction, negation detection, and temporal reasoning. Negation detection deserves special mention because it is where clinical NLP diverges most from general NLP. "No chest pain" and "chest pain" have opposite clinical meanings, and the scope of negation in medical language is often ambiguous ("denies chest pain, shortness of breath, or palpitations" negates all three). We trained a dedicated negation model that resolves scope with high accuracy, because errors here produce clinically dangerous notes.

Layer 4: SOAP Note Assembly and Review

The final layer assembles extracted entities into a structured SOAP note using specialty-specific templates. An orthopedic encounter produces different note structures than a psychiatric evaluation. We built templates for each supported specialty in collaboration with practicing physicians. Each template encodes the required documentation elements for that specialty's billing and clinical standards. The assembly layer uses constrained generation rather than free-form LLM output, which largely eliminated the hallucination problems we saw in early prototypes.

Building the Ambient Listening Engine

Ambient documentation is a fundamentally different problem than dictation. In dictation, physicians speak directly into a microphone using structured language. In ambient mode, the system captures natural conversation. Physicians and patients interrupt each other, use colloquial language, discuss non-medical topics (weekend plans, family updates), and reference documents or prior visits that are not present in the audio stream. The system needs to separate signal from noise in real time.

We built a conversation segmentation model that classifies speech segments into categories: clinically relevant dialogue, social conversation, administrative discussion, and physician thinking aloud. Only clinically relevant segments proceed to the full NLP pipeline. This reduces processing load and, more importantly, improves note quality by excluding content that would confuse the entity extraction stage.

  • Noise suppression: Custom model trained on exam room audio handling HVAC systems, equipment beeps, paper rustling, and hallway noise. Exam rooms are acoustically hostile environments compared to typical ASR training conditions.
  • Speaker diarization: Real-time speaker separation with reasonable accuracy in two-speaker scenarios. Multi-speaker scenarios (attending, resident, patient, family member) are harder and accuracy drops noticeably. This remains an active area of improvement.
  • Overlapping speech: When two people talk at once, we use a multi-channel separation model to recover content. Recovery rates are decent for short overlaps but degrade for sustained crosstalk.
  • Context persistence: Session-level memory maintains clinical context across conversation gaps. A lab value mentioned at minute 12 needs to be associated with a diagnosis discussed at minute 3.
  • Consent management: Automatic recording pause when the system detects sensitive non-clinical discussions, triggered by keyword detection and conversation pattern analysis.

One challenge we did not anticipate was code-switching. In practices with bilingual patient populations, physicians frequently switch between English and Spanish mid-sentence. Our initial English-only model produced garbled output for these encounters. We addressed this by training a bilingual variant that handles mixed-language clinical conversations. Accuracy on code-switched speech is a few points below the English-only benchmark, but it is usable where the English-only model was not.

NLP Pipeline for SOAP Note Generation

Converting a free-form clinical conversation into a structured SOAP note is more than extraction. The pipeline must understand medical reasoning, map colloquial descriptions to standardized vocabulary, and produce documentation that meets both clinical and billing requirements.

Medical Entity Recognition and Linking

Our NER model identifies entity types including symptoms, diagnoses, medications, dosages, procedures, anatomical locations, and lab values. Beyond extraction, we implemented entity linking to SNOMED CT, ICD-10-CM, and RxNorm vocabularies. When a physician says "the patient's sugar has been running high," the system maps this to hyperglycemia in SNOMED, links to the appropriate ICD-10 code, and checks the patient's problem list for existing diabetes diagnoses that might alter the coding. This linking step is where much of the development effort lives, because clinical language is imprecise and the mapping to standardized codes is many-to-many.

Clinical Reasoning Extraction

Physicians rarely state their reasoning explicitly during encounters. They observe, assess, and form plans through a combination of spoken observations and unstated clinical knowledge. We built a model that infers clinical reasoning from the sequence of actions: when a physician asks about chest pain, then auscultates the heart, then orders a troponin level, the model infers an acute coronary syndrome workup. These inferences are presented in the Assessment section as suggestions flagged for physician review, not as assertions. In validation, physicians accepted about three-quarters of inferred reasoning without modification, which suggests the model captures real reasoning patterns, though the remaining quarter is a reminder that clinical inference is genuinely hard.

Structured Output Generation

The SOAP note assembly uses constrained generation. Each section is generated within guardrails defined by the specialty template: the Subjective section must include chief complaint, history of present illness, and review of systems; the Objective section must include vital signs, physical exam findings, and relevant test results. This constraint-based approach was our response to hallucination problems in early prototypes that used unconstrained generation. Constraining the output space reduced clinically inaccurate statements significantly, though it occasionally produces notes that feel formulaic. That trade-off was acceptable given the clinical stakes.

  • Subjective section: High agreement with physician-written notes on chief complaint capture, somewhat lower on HPI completeness where physicians often include contextual information the system cannot observe.
  • Objective section: Strong accuracy on vital signs transcription and physical exam findings, since these tend to be stated explicitly during the encounter.
  • Assessment: The weakest section, because assessment requires clinical reasoning that is often not spoken aloud. This is where physician review adds the most value.
  • Plan: Good capture rate for medication changes, referrals, and follow-up instructions, since these are typically stated clearly to the patient.

Deep EHR Integration Strategy

A documentation system that lives outside the physician's existing workflow will not get used. Our EHR integration was built on a principle: the system should feel like a feature of Epic, not a separate application. We achieved this through bidirectional HL7 FHIR R4 integration that reads patient context before the encounter and writes structured notes back afterward.

Reading Context from the EHR

Before each encounter, the system pulls the patient's active problem list, medication list, allergy list, recent lab results, and prior notes via the FHIR API. This context serves two purposes. It primes the NLP pipeline with patient-specific terminology, improving recognition accuracy for that patient's medications and conditions. And it allows the SOAP note generator to reference prior findings for continuity-of-care documentation. The FHIR read calls are fast (a few hundred milliseconds) and reliable in our experience, though the completeness of data returned varies by EHR configuration.

We implemented a SMART on FHIR application that launches within Epic's workflow. Physicians review, edit, and approve notes within the EHR interface. The review step is important: this is not a fully autonomous system. Every note is physician-reviewed before it becomes part of the medical record. Some notes get approved without edits, many get minor modifications, and a smaller percentage get substantially rewritten. All three outcomes feed data back into the model for continuous improvement.

Writing Notes Back to the EHR

Approved notes are written back using the DocumentReference FHIR resource, with structured data elements mapped to discrete EHR fields. Diagnoses link to the problem list, medications to the medication list, and orders to the order entry system. This structured write-back is the part that took the most iteration to get right. FHIR resources are well-specified, but the way individual EHR instances accept and store them varies. We spent considerable time handling edge cases in the mapping layer.

  • FHIR resources used: Patient, Encounter, Condition, MedicationRequest, DocumentReference, DiagnosticReport, Observation, and ServiceRequest.
  • API latency: A few hundred milliseconds for context reads and note writes, well within what physicians tolerate during their workflow.
  • Fallback strategy: If the FHIR API is unavailable, notes queue locally and sync when connectivity resumes. We have not lost a note in production, though the queue has been exercised during EHR maintenance windows.

Performance Results and Clinical Validation

We rolled the system out in phases, starting with a pilot group before expanding. The metrics we track are charting time, clinical accuracy, and physician edit rates. Here is what we observed.

Documentation Time

Average time from encounter end to note approval dropped substantially. The system generates a draft note in near real-time as the conversation happens, so when the encounter ends, the physician is reviewing a mostly-complete note rather than starting from a blank screen. The reduction in after-hours documentation was the metric physicians cared about most. Less "pajama time" charting was the most commonly cited benefit in feedback.

Clinical Accuracy

We had an independent clinical review performed on a sample of AI-generated notes by board-certified physicians. The clinically significant error rate was low, and comparable to error rates in manually written notes. The AI system also flagged potential documentation issues during generation, such as undocumented medication allergies or dosage inconsistencies in the plan section. These flags turned out to be one of the most valued features because they function as a safety check that manual documentation does not provide.

Practical Impact

  • Coding completeness: More thorough documentation supported more specific coding, which improved reimbursement per encounter by a moderate amount. This was a secondary benefit, not a design goal.
  • Physician retention: Documentation burden is consistently cited as a burnout driver. Reducing it had a noticeable effect on satisfaction scores, though we are cautious about attributing retention changes to a single intervention.
  • Note quality: Notes became more standardized and complete. The templated structure enforces documentation of required elements that physicians sometimes skip when typing free-form.

The metric we watch most closely is the edit rate: what percentage of generated notes do physicians modify before approval? A declining edit rate over time indicates the model is learning from corrections. A rising edit rate would signal a regression. So far the trend has been in the right direction, but it requires ongoing monitoring.

Lessons Learned and What We Would Do Differently

Physician trust is earned through transparency, not accuracy numbers. Our initial prototype presented AI-generated notes as finished products. Physicians did not trust them. When we redesigned the interface to show confidence scores for each section and highlight AI-inferred content in a different color, adoption improved dramatically. Showing your work matters more than being right.

Specialty-specific training data is non-negotiable. Our early model was trained on general medical transcription data. It performed poorly on dermatology and psychiatry encounters where vocabulary and documentation patterns differ from internal medicine. We invested time in collecting and annotating specialty-specific data, and the accuracy improvements were significant. If you are building a system like this, budget for specialty-specific data collection from the start.

  • Start with physician champions: Identify early-adopter physicians who are willing to provide feedback. Their peer advocacy drives adoption more effectively than mandates from administration.
  • Build the feedback loop first: The ability for physicians to correct AI output creates a continuous improvement dataset. This correction data is more valuable than any pre-training dataset because it captures the specific documentation preferences of each practice.
  • Plan for edge cases early: Pediatric encounters with parents speaking for children, patients with speech impediments, and group sessions all require specialized handling that is easy to underestimate in initial scoping.
  • Invest in monitoring: A real-time accuracy dashboard that flags notes where AI confidence drops below thresholds enables proactive quality assurance. You need to know when the model is struggling before physicians report it.

If we were starting this project today, we would invest more in synthetic data generation for rare specialties and edge cases. We would also implement a more granular consent framework from day one, allowing patients to opt out of ambient recording while still enabling physician-initiated dictation as a fallback. Both are refinements we have since incorporated into subsequent builds.

Healthcare AI Solutions

Ready to Reduce Documentation Burden for Your Physicians?

Our healthcare engineering team has deployed AI clinical documentation systems across multi-specialty practices, hospitals, and health systems. Let us show you how ambient AI can transform your clinical workflow.

Talk to Our Healthcare Team

You might also like

More from our Healthcare practice

Stay sharp with our stories

Get healthcare tech insights in your inbox.

We hit send on the second and fourth Thursday.