AI Document Comparison Engine for Healthcare Insurance Underpayments

Key Takeaways

Insurance EOBs arrive in dozens of inconsistent formats — PDFs, 835 EDI files, scanned images — and payers routinely reorder line items, bundle codes differently, and apply rates inconsistently. A reliable reconciliation engine needs fuzzy matching that accounts for all of these variations, not just exact primary-key joins.
We built a discrepancy scoring system that weighs dollar impact, appeal success likelihood (trained on historical outcomes), and filing deadline proximity. The scoring model is what makes the system practical — without prioritization, the volume of small discrepancies would overwhelm any review team.
The hardest engineering problem was not the NLP or the matching algorithm. It was modeling payer contracts accurately — contracts have temporal validity windows, rate escalation clauses, volume tiers, and carve-outs for specific code ranges, all of which affect what the "correct" reimbursement should be.

The Underpayment Problem in Healthcare Revenue Cycle

Healthcare providers submit claims and receive Explanation of Benefits (EOB) documents back from payers. In theory, the EOB reflects the contracted rate for each service. In practice, discrepancies are common — modifiers get down-coded, line items get bundled incorrectly, or the contracted rate simply is not applied. Industry estimates put underpayment rates at 1-3% of net revenue, which for mid-size practice groups can mean six figures annually going unrecovered.

The standard approach is manual: billing staff compare claims against EOBs line by line, cross-referencing fee schedules stored in spreadsheets. A skilled analyst reviews maybe 40-60 EOBs per day. Most organizations only audit a sample — typically 10-15% of remittances — which means the majority of discrepancies are never even looked at.

We set out to build a system that could reconcile every remittance against the original claim and contracted rates, automatically and at scale. This post covers the architecture decisions, the interesting technical problems we ran into, and what we learned about building document comparison systems for messy real-world healthcare data.

Why Existing Tools Fall Short

Most contract management platforms assume structured data inputs. The reality is messier: EOBs arrive as PDFs with wildly different layouts, 835 electronic remittance files have inconsistent formatting across payers, and contract terms live in amendment letters that nobody has digitized. We needed a system that could handle unstructured and semi-structured documents natively, extract financial data reliably, and perform comparison at the line-item level — even when the two sides of the comparison do not agree on how to represent the same service.

Architecture Overview

The system is a four-stage pipeline: document ingestion, NLP-based parsing, fuzzy matching and reconciliation, and discrepancy scoring with appeal generation. Each stage runs as an independent service communicating through SQS queues. We chose this over a monolithic design primarily for independent scaling — the NLP parsing stage is CPU-intensive and needs different instance types than the matching engine, which is memory-bound due to the contract rate lookup cache.

Document Ingestion Service: Accepts EOBs (PDF, 835/EDI, and scanned images) via SFTP, email parsing, and clearinghouse API integrations. Files are normalized into a common internal format and stored in S3 with SSE-KMS encryption.
NLP Parsing Engine: A Python service using spaCy and a fine-tuned RoBERTa model that extracts structured claim data — CPT codes, modifiers, billed amounts, allowed amounts, and adjustment reason codes — from unstructured documents.
Reconciliation Engine: A Node.js/TypeScript service that performs line-item matching between parsed EOB data and original claim submissions pulled from the practice management system via FHIR APIs.
Scoring & Appeal Service: A rules engine combined with a gradient-boosted classifier that scores discrepancies by dollar impact, appeal success probability, and filing deadline proximity. High-confidence discrepancies feed into an appeal letter generation pipeline.

Technology Stack Decisions

Python was the natural choice for the NLP pipeline — spaCy, Hugging Face, and scikit-learn are all mature and well-supported for this kind of work. We went with Node.js/TypeScript for the reconciliation engine because the financial calculations benefit from strict type safety, and the team was already comfortable with TypeScript. Infrastructure runs on AWS with ECS Fargate for containers, RDS PostgreSQL for persistence, and ElastiCache Redis for caching contract rate lookups. The rate cache matters because a single EOB reconciliation might need to look up 15-20 contracted rates, and those lookups need to be fast.

All services sit behind API Gateway with mutual TLS. PHI is encrypted at rest with KMS-managed keys and in transit via TLS 1.3. We added field-level encryption for patient identifiers so that internal services see tokenized references unless they have explicit decryption permissions — this keeps the blast radius small if any single service is compromised.

NLP Document Parsing Pipeline

The parsing pipeline was the most technically interesting component. We encountered over 340 distinct EOB layouts across 47 payers. No two payers format their EOBs the same way, and several payers changed their layouts during the project without any notice. The system needed to be robust to layout variation without requiring manual template creation for every payer.

OCR and Pre-Processing

For scanned PDFs and images, we use Amazon Textract as the primary OCR engine with Tesseract as a fallback when Textract confidence scores drop below 85%. The pre-processing pipeline applies deskewing, contrast normalization, and noise reduction before OCR. For native PDFs, we skip OCR entirely and use pdfplumber for direct text extraction — faster and more accurate when the text layer is available.

After text extraction, a layout analysis module identifies table structures, header-detail relationships, and page continuation patterns. This step is important because a single EOB might span 12 pages with line items split across page breaks. Adjustment codes on page 7 need to be correctly associated with the CPT code from page 5. Getting this wrong means misattributing adjustments, which produces false positive discrepancies downstream.

Fine-Tuned NER for Financial Data Extraction

We fine-tuned a RoBERTa-base model on approximately 8,400 labeled EOB documents to recognize healthcare financial entities: CPT/HCPCS codes, ICD-10 codes, modifier codes, billed amounts, allowed amounts, copay amounts, CARCs, and RARCs. The fine-tuned model reached an F1 of 0.923 on the held-out test set, compared to 0.671 from the base spaCy medical NER model. That gap is substantial — at 0.671, too many extracted values are wrong, and the downstream matching engine generates noise instead of signal.

Entity validation: Extracted codes are validated against CMS reference databases. A CPT code is verified against the current codebook, and adjustment codes are mapped to standard CARC/RARC descriptions. This catches OCR errors that produce plausible but incorrect codes — "99213" versus "99214" is a common OCR confusion that changes the expected reimbursement.
Confidence thresholds: Each extracted entity carries a confidence score. Entities below 0.80 are routed to human review rather than auto-processed. This trade-off between automation rate and accuracy was one we tuned over several weeks — lower thresholds mean more automation but more false positives; higher thresholds mean more human review but cleaner data.
Retraining loop: Corrections from human reviewers feed back into the training pipeline weekly, with model retraining monthly. Over six months, average extraction confidence improved from 0.89 to 0.94 as the model saw more payer-specific variations.

Fuzzy Matching & Line-Item Reconciliation

With structured data from both the original claim and the EOB, the reconciliation engine matches line items across the two documents. This sounds straightforward, but payers routinely reorder line items, split bundled codes, combine separate services, and use different date formatting. In our dataset, a direct primary-key join fails for roughly 23% of line items. That is too many to ignore and too many to review manually.

Multi-Factor Matching Algorithm

We built a multi-factor matching algorithm that scores potential line-item pairs across five dimensions: CPT code similarity (exact match, code family proximity, or known bundling relationship), date of service alignment (exact, adjacent day, or within the same encounter window), billed amount proximity (within 5% tolerance for fee schedule updates), modifier compatibility (accounting for known substitution patterns), and provider NPI matching.

Each factor produces a normalized 0-1 score, and factors are weighted per payer based on learned behavior. For example, one major payer frequently reorders line items but preserves exact CPT codes, so CPT match weight is high but sequence position weight is near zero for their remittances. These payer-specific weight profiles are learned automatically from historical matched pairs — we seed the weights with uniform values and let the system adapt as it processes more data from each payer.

Match thresholds: A composite score above 0.92 is a confirmed match. Scores between 0.75 and 0.92 enter a probabilistic matching queue where the system considers multiple candidate pairs and selects the combination that maximizes global match coverage (this is essentially a weighted bipartite matching problem). Below 0.75, items are flagged as potential denials, new charges, or bundling events.
Unmatched items: Line items that do not match anything are categorized by a separate classifier into actionable buckets — full denial, partial payment to wrong line, bundling event, or data entry error. This categorization drives different downstream workflows.
Throughput: The matching engine processes roughly 2,400 line items per second on a single ECS task with 4 vCPUs and 8GB RAM. A typical day of 3,200 EOBs averaging 8 line items each completes reconciliation in under 15 seconds.

Contract Rate Verification

After matching line items, the system pulls the applicable contracted rate from the contract database. This turned out to be harder than the fuzzy matching itself. Contracts have effective date ranges, rate escalation clauses, carve-out provisions for specific code ranges, and tiered rates based on volume thresholds. We modeled contracts as versioned documents with temporal validity windows and built a rate lookup service that accepts a CPT code, date of service, payer ID, and provider TIN, returning the applicable rate along with a confidence indicator reflecting whether any contract ambiguity exists (overlapping amendments, missing effective dates, etc.).

Discrepancy Scoring Engine

Not every discrepancy is worth pursuing. A $0.50 difference on a lab draw does not justify the staff time to file an appeal, but a $340 shortfall on a surgical procedure does. Without scoring and prioritization, the system would generate thousands of discrepancies daily and the review team would be worse off than before. The scoring engine prioritizes discrepancies using a composite score balancing dollar impact, appeal success probability, and time sensitivity.

Score Components

Dollar impact: The absolute difference between expected and actual payment. We apply a logarithmic scale so a $500 discrepancy scores meaningfully higher than a $50 one, but a $5,000 discrepancy does not completely dominate the queue.
Appeal success probability: A gradient-boosted classifier trained on historical appeal outcomes. Features include payer ID, adjustment reason code, discrepancy type, dollar amount, and days since date of service. The model achieves 0.81 AUC on held-out data — not outstanding, but useful enough for prioritization. The main limitation is that appeal outcomes depend heavily on the quality of supporting documentation, which is hard to quantify as a feature.
Filing deadline proximity: Payer contracts specify appeal deadlines ranging from 60 to 365 days. Discrepancies approaching their deadline receive an exponentially increasing urgency multiplier. Missing a deadline means zero recovery regardless of how valid the discrepancy is.
Pattern detection: The engine identifies systematic underpayment patterns — for example, a specific payer consistently under-reimbursing a particular CPT code. Pattern detection escalates the priority of all related discrepancies and flags them for contract review, since the root cause may be a contract interpretation disagreement rather than individual processing errors.

Threshold Calibration

We calibrated scoring thresholds through a four-week parallel run where the engine scored all discrepancies while the billing team continued their manual process. Comparing the engine's output against the team's actual decisions showed that the engine caught most of what the team found, plus additional discrepancies in complex bundling scenarios and modifier down-coding situations that require cross-referencing multiple CMS guidelines. The parallel run was essential — it gave the billing team confidence in the system before we asked them to change their workflow.

Automated Appeal Generation

For discrepancies above the action threshold, the system generates appeal packets. Each packet includes a cover letter citing the specific contract provision, a side-by-side comparison of the submitted claim and the remittance, the applicable fee schedule excerpt, and supporting clinical documentation pulled from the EHR via FHIR API.

Template Engine Architecture

Appeal letters use a template engine that combines payer-specific formatting requirements with dynamic content blocks. We maintain templates covering the most common appeal scenarios: rate discrepancy, modifier down-coding, bundling dispute, medical necessity, and timely filing. Each template is parameterized with claim details, contracted rates, and regulatory citations relevant to the discrepancy type.

For discrepancy types that do not fit existing templates, we use GPT-4 to draft the appeal narrative, constrained by a structured prompt that ensures all required elements (contract reference, regulatory citation, requested action) are present. These drafts go through human review before submission. The reviewer typically spends a few minutes per letter versus the 25-35 minutes required to draft from scratch — the LLM handles the boilerplate, and the reviewer focuses on accuracy and strategy.

Submission and Tracking

Multi-channel submission: Appeals are submitted through the optimal channel per payer — electronic submission via payer portals (automated with Playwright for major payers), clearinghouse 837 resubmission for coding corrections, or PDF upload for payers that only accept document-based appeals.
Status tracking: Each appeal is tracked through its lifecycle with automated follow-up triggers. If a payer has not responded within their stated SLA (typically 30-45 days), the system generates a follow-up inquiry automatically.
Outcome feedback: Appeal results feed back into the scoring model. Win/loss patterns by payer and discrepancy type inform both the appeal success predictor and the template selection logic — if a particular argument style consistently loses with a specific payer, the system learns to try a different approach.

Results & Performance Metrics

After several months in production, we have enough data to evaluate how the system performs. The numbers are grounded in actual production metrics, not projections.

Throughput: The engine processes a full day of remittances (typically 300-350 EOBs) in under 4 minutes, compared to the multiple FTE-days previously required for partial manual review. This means every remittance gets reviewed, not just a 10-15% sample.
Matching accuracy: The fuzzy matching algorithm correctly pairs line items with roughly 95% precision, verified against manual audit of a 500-discrepancy sample. The remaining 5% are mostly complex bundling scenarios where even experienced billing staff disagree on the correct match.
Time-to-appeal: Average time from EOB receipt to appeal submission dropped from about two weeks to under 2 days. The biggest impact is on time-sensitive appeals where filing deadlines are tight — previously, some valid discrepancies were discovered after the filing window closed.
Appeal quality: Appeals generated with complete contract citations and structured supporting documentation tend to perform better than manually drafted appeals. Consistent citation of contract terms and regulatory references makes a difference, particularly with payers that process appeals through rules-based adjudication systems.
Staffing impact: The billing team was able to shift several FTEs from manual underpayment review to higher-value work like contract negotiation and denial prevention. The system handles the high-volume mechanical work; people handle the judgment calls.

The system is not perfect. The NLP parser still struggles with certain payer formats that use non-standard table layouts, and the contract rate lookup occasionally returns ambiguous results when amendments overlap. We are actively working on both issues. But the overall architecture has proven sound — the pipeline design means we can improve individual stages without touching the others, and the scoring system ensures that even when errors occur, they affect low-priority items rather than high-value discrepancies.

Revenue Cycle Intelligence

Interested in automating EOB reconciliation?

We build document comparison and reconciliation systems for healthcare revenue cycle workflows. If you are dealing with similar challenges, we would be happy to discuss the technical approach.

Talk to Our Healthcare Team

RECOMMENDED AI BLOGS

AI Agents for Enterprise

Product Development & Engineering

Digital Transformation

Consulting Services

Data Services

IT Managed & Outsourcing

Healthcare

RECOMMENDED BLOGS

AI Agents for Enterprise

Building a Document Diff Engine for Insurance EOB Reconciliation

Key Takeaways

The Underpayment Problem in Healthcare Revenue Cycle

Why Existing Tools Fall Short

Architecture Overview

Technology Stack Decisions

NLP Document Parsing Pipeline

OCR and Pre-Processing

Fine-Tuned NER for Financial Data Extraction

Fuzzy Matching & Line-Item Reconciliation

Multi-Factor Matching Algorithm

Contract Rate Verification

Discrepancy Scoring Engine

Score Components

Threshold Calibration

Automated Appeal Generation

Template Engine Architecture

Submission and Tracking

Results & Performance Metrics

Interested in automating EOB reconciliation?

More from our Healthcare practice

Building a Document Diff Engine for Insurance EOB Reconciliation

Key Takeaways

The Underpayment Problem in Healthcare Revenue Cycle

Why Existing Tools Fall Short

Architecture Overview

Technology Stack Decisions

NLP Document Parsing Pipeline

OCR and Pre-Processing

Fine-Tuned NER for Financial Data Extraction

Fuzzy Matching & Line-Item Reconciliation

Multi-Factor Matching Algorithm

Contract Rate Verification

Discrepancy Scoring Engine

Score Components

Threshold Calibration

Automated Appeal Generation

Template Engine Architecture

Submission and Tracking

Results & Performance Metrics

Interested in automating EOB reconciliation?

More from our Healthcare practice

Get healthcare tech insights in your inbox.