AI Treatment Outcome Prediction for Dental Sleep Medicine

Key Takeaways

For tabular clinical data with moderate dataset sizes (thousands, not millions, of records), XGBoost consistently outperformed neural networks in our experiments. The neural network showed higher training variance across folds, suggesting overfitting — a common pattern with clinical datasets where the feature space is wide relative to the number of samples.
SHAP explainability was the difference between a model that sat on a shelf and one that clinicians actually used. Presenting a bare probability score resulted in near-zero adoption. Showing which patient features drove each prediction — and letting clinicians verify the reasoning against their clinical knowledge — changed the conversation entirely.
Defining the outcome variable precisely was harder than building the model. "Treatment success" in clinical settings has multiple valid definitions, and the choice of definition directly affects model behavior, clinical utility, and how stakeholders interpret results. We spent more time on this than on hyperparameter tuning.

The Clinical Decision Challenge

In dental sleep medicine, oral appliances treat obstructive sleep apnea by repositioning the mandible to maintain an open airway during sleep. It works for many patients, but not all — published success rates range from 50% to 75% depending on patient selection criteria and how success is defined. For a treatment that costs $2,500-$4,500 and takes 3-6 months, better prediction of who will respond is worth pursuing.

Clinicians currently rely on a combination of experience, screening heuristics (Epworth Sleepiness Scale, AHI severity, BMI thresholds), and cephalometric analysis. These work reasonably well at the extremes — a thin patient with mild apnea and normal jaw anatomy is likely to succeed, while a patient with severe apnea and unfavorable anatomy is likely to fail. But the majority of patients fall in the ambiguous middle where clinical judgment alone is not reliable.

We had access to 8 years of treatment outcome data — clinical measurements, treatment parameters, and post-treatment sleep study results for over 6,400 patients. The question was whether this data contained enough signal to meaningfully improve treatment selection. This post covers the feature engineering, model selection trade-offs, and the validation methodology we used.

Defining the Outcome Variable

Before building any model, we needed to define what we were predicting. "Treatment success" in dental sleep medicine has at least four accepted definitions: AASM criteria (AHI below 5), Medicare criteria (AHI reduced by 50% and below 20), symptomatic improvement (Epworth Sleepiness Scale reduction), or patient-reported outcomes. These overlap but are not equivalent — a patient can meet Medicare criteria but still feel tired, or report subjective improvement without objective AHI reduction.

After working with the clinical team, we settled on a composite metric requiring both objective improvement (AHI reduction of at least 50% or below 10) and subjective improvement (ESS reduction of at least 4 points or below 10). This dual criterion better reflects what clinicians consider a successful outcome than any single metric. It also changed the class balance in our dataset — using the composite definition, 62% of patients met the success criteria, compared to 71% with the more lenient Medicare definition alone.

Dataset & Feature Engineering

From 6,412 patient records spanning 2017-2025, we retained 4,847 complete records after excluding those with missing post-treatment sleep studies or follow-up assessments. This 24% attrition is typical for clinical datasets — patients drop out of follow-up, transfer to other providers, or have incomplete records. The retained dataset had 3,014 successes (62%) and 1,833 non-responders.

Feature Categories

We engineered 73 candidate features across five clinical domains, drawing on both published predictive factors and features suggested by the clinical team from their practice experience.

Anthropometric features (12): BMI, neck circumference, waist-to-hip ratio, Mallampati score, tongue volume index, and body composition metrics. These capture the physical characteristics influencing airway collapsibility.
Sleep study parameters (15): Baseline AHI, oxygen desaturation index, minimum SpO2, time below 90% SpO2, REM/NREM AHI ratio, supine vs. non-supine AHI, respiratory event duration distribution, and arousal index. These quantify the severity and pattern of sleep-disordered breathing.
Craniofacial morphology (18): Cephalometric measurements — SNB angle, ANB angle, mandibular plane angle, hyoid-to-mandibular plane distance, posterior airway space, soft palate dimensions, and tongue-to-mandible ratio. Extracted from lateral cephalograms using automated cephalometric analysis.
Dental and occlusal features (14): Overjet, overbite, remaining teeth count, periodontal status, maximum comfortable protrusion distance, protrusion-to-maximum ratio, and interincisal opening. These determine the mechanical range available for mandibular advancement.
Demographics and comorbidities (14): Age, sex, smoking, alcohol consumption, hypertension, diabetes, GERD, depression, medication use (sedatives, muscle relaxants), and prior CPAP trial history.

Feature Selection and Preprocessing

We applied a three-stage feature selection process. First, features with more than 25% missing values were dropped (8 features only consistently recorded in recent years). Second, mutual information scoring and LASSO regularization identified features with meaningful signal, reducing from 65 to 41. Third, we removed highly correlated pairs (Pearson r > 0.85), keeping the more predictive feature. The final model uses 34 features.

Missing values in the retained features were handled with multiple imputation using chained equations (MICE), generating 5 imputed datasets. Models were trained on each and predictions averaged at inference using Rubin's pooling rules. We used robust scaling (median and IQR) rather than standard normalization because clinical measurements have a meaningful number of outliers — an AHI of 120 is rare but real, and standard scaling would compress the normal range to accommodate these extremes.

Model Architecture & Training

We evaluated five architectures: logistic regression (baseline), random forest, XGBoost, SVM with RBF kernel, and a feedforward neural network (3-layer, 128-64-32). Model selection was driven by two criteria: predictive performance and interpretability. In clinical settings, interpretability is not a nice-to-have — clinicians will not change treatment decisions based on a model they cannot reason about.

Model Comparison

We used stratified 5-fold cross-validation on the 4,847-record dataset, holding out 20% (969 records) as a final test set never touched during development.

Logistic regression: AUC 0.782, accuracy 74.1%. Fully interpretable but unable to capture non-linear feature interactions (like the relationship between BMI and airway anatomy) that clinicians know matter.
Random forest: AUC 0.841, accuracy 79.8%. Good performance with inherent feature importance, but poorly calibrated — predicted probabilities did not match observed frequencies, which is a problem when you need to communicate risk to clinicians.
XGBoost: AUC 0.891, accuracy 83.6%. Best raw performance. After Platt scaling for calibration, predicted probabilities aligned well with observed outcomes (Brier score 0.142). Selected as the production model.
SVM (RBF): AUC 0.867, accuracy 81.2%. Strong performance but inherently opaque — the kernel transformation that makes SVMs powerful makes them hard to explain to clinicians.
Neural network: AUC 0.879, accuracy 82.9%. Slightly lower than XGBoost with significantly higher training variance across folds. With ~5,000 records and 34 features, the neural network was overfitting despite regularization. This is a common pattern with tabular clinical data — the dataset is not large enough for the network to generalize reliably.

Hyperparameter Optimization

XGBoost hyperparameters were optimized using Bayesian optimization via Optuna with 200 trials, targeting AUC on validation folds. Final parameters: max_depth=6, learning_rate=0.043, n_estimators=842, min_child_weight=3, subsample=0.82, colsample_bytree=0.78, gamma=0.15. Early stopping with patience of 50 rounds prevented overfitting.

On the held-out test set, the model achieved AUC 0.873, accuracy 83.7%, sensitivity 89.1%, and specificity 84.2%. We set the decision threshold at 0.58 rather than the default 0.5, based on a clinical utility analysis that weighted false negatives (denying treatment to a patient who would benefit) more heavily than false positives (recommending treatment to someone who will not fully respond). This threshold choice reflects the clinical context — the downside of a false negative (patient misses an effective treatment) is worse than the downside of a false positive (patient tries treatment that works partially or not at all, then switches to an alternative).

Model Explainability with SHAP

During an initial pilot with three clinicians, we presented the model as a simple success probability. The response was unanimous: "I am not going to change my treatment recommendation based on a number from a black box." This was not resistance to technology — it was entirely reasonable. Clinicians need to understand why a prediction was made to evaluate whether the reasoning is sound for a specific patient. A bare probability does not give them anything to work with.

SHAP Implementation

We implemented SHAP using the TreeSHAP algorithm, which computes exact Shapley values in polynomial time for tree-based models (versus the exponential time of model-agnostic kernel SHAP). For each prediction, the system generates both a global explanation (which features matter most across the dataset) and a local explanation (which features drove this specific patient's prediction).

Waterfall plots: For each patient, a waterfall shows the base prediction (population average 62% success rate), then adds or subtracts each feature's contribution. A clinician can see that this patient's high BMI decreases predicted success by 8 points, but favorable airway anatomy increases it by 12. This gives clinicians something concrete to evaluate against their clinical knowledge.
Feature interaction detection: SHAP interaction values revealed clinically meaningful interactions not previously documented. The negative impact of high BMI is modulated by mandibular protrusion range — patients with high BMI but exceptional protrusion capacity (above 11mm) have significantly better outcomes than BMI alone would predict. The clinical team validated this and updated their screening protocol.
Counterfactual explanations: Beyond explaining the current prediction, we generate scenarios like: "If this patient's BMI were 28 instead of 34, predicted success would increase from 54% to 71%." These help clinicians discuss modifiable risk factors with patients in concrete terms.

Top Predictive Features

The global SHAP analysis revealed the top 10 features: (1) maximum comfortable protrusion distance, (2) baseline AHI, (3) BMI, (4) posterior airway space, (5) mandibular plane angle, (6) supine/non-supine AHI ratio, (7) age, (8) minimum SpO2, (9) Mallampati score, (10) hyoid-to-mandibular plane distance. The most interesting finding was that protrusion capacity — which is not part of standard screening protocols — emerged as the single most predictive feature. Several features clinicians traditionally emphasize (neck circumference, Epworth score) ranked lower than expected. This kind of finding is where ML adds genuine value to clinical practice: surfacing signal that exists in the data but is not captured by traditional heuristics.

Clinical Validation Study

Before deploying the model, we ran a prospective validation study across 8 practices over 6 months. The study enrolled 340 new patients randomized into standard care (n=168) and model-assisted care (n=172). In the intervention group, clinicians received the prediction and SHAP explanation as a decision support tool. The model was advisory — clinicians could override it freely, and we tracked override rates.

Study Results

The study design was straightforward but the results contained some nuance worth discussing.

Success rates: The model-assisted group achieved 78% composite success versus 64% in the control group. The improvement came primarily from better patient selection — the model helped clinicians identify borderline patients unlikely to respond and redirect them to alternatives earlier.
Time to efficacy: Among patients who succeeded, the model-assisted group reached therapeutic efficacy in an average of 67 days versus 101. The model's predictions influenced initial appliance parameters (protrusion targets), reducing the number of titration adjustments needed.
Clinician overrides: Clinicians overrode the model 18.6% of the time. Of these, 71% were cases where clinicians proceeded despite a low prediction, typically based on patient preference or factors not in the model. Override cases had a 47% success rate versus the model's predicted 38% — clinicians capture some valid signal beyond the model's features, but the model's skepticism was largely justified.
Patient communication: Clinicians reported that the SHAP explanations helped them have more transparent conversations about expected outcomes and risk factors. Patient satisfaction scores were 12% higher in the intervention group, attributed to better expectation-setting.

Production Integration

After the validation study, we deployed across all 45 practices. The integration needed to be seamless — clinicians were not going to adopt a tool that required switching applications or manual data entry.

Integration Architecture

The prediction service exposes a FHIR-compatible API that accepts a patient resource bundle and returns a RiskAssessment resource with the prediction, confidence interval, and SHAP explanations encoded as extensions. The clinical platform calls this API automatically when a clinician completes the initial assessment. The prediction appears as a decision support card within the consultation interface — no context switching.

Inference latency: The service runs on AWS Lambda with provisioned concurrency. Average latency is 340ms including SHAP computation, well within the 2-second threshold for clinical decision support responsiveness.
Model deployment: The XGBoost model is serialized via joblib and loaded into warm Lambda environments. We use blue-green deployment for updates — the new model runs alongside the old one, and traffic shifts after automated regression tests pass on a reference dataset.
Data drift monitoring: A pipeline monitors incoming feature distributions against training data using Population Stability Index. PSI above 0.25 for any feature triggers an alert. In 8 months of production, we triggered one retrain cycle driven by a shift in the patient population's average BMI.
Feedback loop: Treatment outcomes are captured when post-treatment sleep studies complete (typically 3-6 months after delivery) and feed back into the training dataset. The model is retrained quarterly with the expanded dataset, and performance is compared against the previous version before deployment.

Outcomes & Clinical Impact

After 8 months in production across the network, here is what we can measure. The model has become a regular part of the clinical workflow, which itself is a meaningful outcome — adoption of clinical decision support tools is notoriously low.

Success rate improvement: Network-wide treatment success improved from 62% (historical) to 76%. Most of this comes from better patient selection rather than better treatment — the model identifies patients unlikely to respond before they start a multi-month treatment course.
Treatment abandonment: The rate of patients abandoning treatment within 90 days dropped from 21% to 11%. Better prediction and expectation-setting reduced the number of patients who start treatment and then give up.
Clinician adoption: 89% of clinicians report consulting the model for every new patient, and 73% say it has changed at least one treatment recommendation per month. SHAP explanations were cited as the primary driver of trust.
Clinical research: The SHAP analysis generated two findings that resulted in peer-reviewed publications — the predictive power of protrusion capacity (a novel finding) and the BMI-protrusion interaction effect. This is a nice secondary benefit of doing explainability well.
Insurance authorization: Several payers accepted the model's prediction as supporting documentation for prior authorization, reducing denial rates from 28% to 19% in participating payer networks. This was unexpected and is still limited to a few payers, but it suggests a direction.

The model is not without limitations. It was trained on data from a specific practice network, so its generalizability to different patient populations is unproven. The 34 features require a complete clinical assessment to be available, which means the prediction cannot be generated during a phone screening. And the model currently predicts binary success/failure — we are exploring extensions to predict continuous outcomes (expected AHI reduction) and optimal initial treatment parameters, which would be more clinically useful.

Clinical AI

Building predictive models on clinical data?

We have experience with feature engineering, validation methodology, and SHAP-based explainability for healthcare ML. If you are working on similar problems, we would be glad to discuss the approach.

Talk to Our Healthcare Team

Clinical & Front-Office

Revenue Cycle & Insurance

Finance & Accounting

Procurement & Supply Chain

Sales & RevOps

People & HR

IT & Engineering

End-to-end delivery, one trusted team

Transforming Every Industry

FHIR-Native Data Fabric

Healthcare

RECOMMENDED BLOGS

AI Agents for Enterprise

Feature Engineering and Model Selection for Clinical Treatment Outcome Prediction

Key Takeaways