Efficient Variance-reduced Estimation from Generative EHR Models: The SCOPE and REACH Estimators
- URL: http://arxiv.org/abs/2602.03730v1
- Date: Tue, 03 Feb 2026 16:49:44 GMT
- Title: Efficient Variance-reduced Estimation from Generative EHR Models: The SCOPE and REACH Estimators
- Authors: Luke Solo, Matthew B. A. McDermott, William F. Parker, Bashar Ramadan, Michael C. Burkhart, Brett K. Beaulieu-Jones,
- Abstract summary: Generative models trained using tokenized electronic health record timelines show promise for clinical outcome prediction.<n>Existing approaches suffer from three key limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational costs, and high sampling variance.<n>We propose two new estimators: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH), that leverage next-token probability distributions discarded by standard Monte Carlo.
- Score: 1.0765331200875645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models trained using self-supervision of tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction. This is typically done using Monte Carlo simulation for future patient trajectories. However, existing approaches suffer from three key limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational costs, and high sampling variance. We propose two new estimators: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH), that leverage next-token probability distributions discarded by standard Monte Carlo. We prove both estimators are unbiased and that REACH guarantees variance reduction over Monte Carlo sampling for any model and outcome. Empirically, on hospital mortality prediction in MIMIC-IV using the ETHOS-ARES framework, SCOPE and REACH match 100-sample Monte Carlo performance using only 10-11 samples (95% CI: [9,11]), representing a ~10x reduction in inference cost without degrading calibration. For ICU admission prediction, efficiency gains are more modest (~1.2x), which we attribute to the outcome's lower "spontaneity," a property we characterize theoretically and empirically. These methods substantially improve the feasibility of deploying generative EHR models in resource-constrained clinical settings.
Related papers
- Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models [2.1127261244588156]
We propose a novel bootstrapping-based regularisation framework that embeds the bootstrapping process directly into the training of deep neural networks.<n>This approach constrains prediction variability across resampled datasets, producing a single model with inherent stability properties.<n>We evaluated models constructed using the proposed regularisation approach against conventional and ensemble models.
arXiv Detail & Related papers (2026-02-11T20:47:30Z) - A Generative Framework for Causal Estimation via Importance-Weighted Diffusion Distillation [55.53426007439564]
Estimating individualized treatment effects from observational data is a central challenge in causal inference.<n>In inverse probability weighting (IPW) is a well-established solution to this problem, but its integration into modern deep learning frameworks remains limited.<n>We propose Importance-Weighted Diffusion Distillation (IWDD), a novel generative framework that combines the pretraining of diffusion models with importance-weighted score distillation.
arXiv Detail & Related papers (2025-05-16T17:00:52Z) - A Novel Characterization of the Population Area Under the Risk Coverage Curve (AURC) and Rates of Finite Sample Estimators [22.231565523932286]
Area Under the Risk-Coverage Curve (AURC) has emerged as the foremost evaluation metric for assessing the performance of SC systems.<n>We derive empirical AURC plug-in estimators for finite sample scenarios.<n>We empirically validate the effectiveness of our estimators through experiments across multiple datasets.
arXiv Detail & Related papers (2024-10-20T11:14:51Z) - Bayesian calibration of stochastic agent based model via random forest [1.4447019135112433]
Agent-based models (ABM) provide an excellent framework for modeling outbreaks and interventions in epidemiology.
These models are usually and highly parametrized, requiring precise calibration for predictive performance.
This paper presents a random forest based surrogate modeling technique to accelerate the evaluation of ABMs.
arXiv Detail & Related papers (2024-06-27T20:50:06Z) - High Precision Causal Model Evaluation with Conditional Randomization [10.23470075454725]
We introduce a novel low-variance estimator for causal error, dubbed as the pairs estimator.
By applying the same IPW estimator to both the model and true experimental effects, our estimator effectively cancels out the variance due to IPW and achieves a smaller variance.
Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself.
arXiv Detail & Related papers (2023-11-03T13:22:27Z) - Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression [47.1405366895538]
Sparse linear regression methods for high-dimensional data commonly assume that residuals have constant variance, which can be violated in practice.
This paper proposes estimating high-dimensional heteroscedastic linear regression models using a heteroscedastic partitioned empirical Bayes Expectation Conditional Maximization algorithm.
arXiv Detail & Related papers (2023-09-15T22:06:29Z) - Forecast reconciliation for vaccine supply chain optimization [61.13962963550403]
Vaccine supply chain optimization can benefit from hierarchical time series forecasting.
Forecasts of different hierarchy levels become incoherent when higher levels do not match the sum of the lower levels forecasts.
We tackle the vaccine sale forecasting problem by modeling sales data from GSK between 2010 and 2021 as a hierarchical time series.
arXiv Detail & Related papers (2023-05-02T14:34:34Z) - Clinical Deterioration Prediction in Brazilian Hospitals Based on
Artificial Neural Networks and Tree Decision Models [56.93322937189087]
An extremely boosted neural network (XBNet) is used to predict clinical deterioration (CD)
The XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.
arXiv Detail & Related papers (2022-12-17T23:29:14Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Increasing the efficiency of randomized trial estimates via linear
adjustment for a prognostic score [59.75318183140857]
Estimating causal effects from randomized experiments is central to clinical research.
Most methods for historical borrowing achieve reductions in variance by sacrificing strict type-I error rate control.
arXiv Detail & Related papers (2020-12-17T21:10:10Z) - STELAR: Spatio-temporal Tensor Factorization with Latent Epidemiological
Regularization [76.57716281104938]
We develop a tensor method to predict the evolution of epidemic trends for many regions simultaneously.
STELAR enables long-term prediction by incorporating latent temporal regularization through a system of discrete-time difference equations.
We conduct experiments using both county- and state-level COVID-19 data and show that our model can identify interesting latent patterns of the epidemic.
arXiv Detail & Related papers (2020-12-08T21:21:47Z) - WRSE -- a non-parametric weighted-resolution ensemble for predicting
individual survival distributions in the ICU [0.251657752676152]
Dynamic assessment of mortality risk in the intensive care unit (ICU) can be used to stratify patients, inform about treatment effectiveness or serve as part of an early-warning system.
We show competitive results with state-of-the-art probabilistic models, while greatly reducing training time by factors of 2-9x.
arXiv Detail & Related papers (2020-11-02T10:13:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.