Controlling for Unobserved Confounding with Large Language Model Classification of Patient Smoking Status
- URL: http://arxiv.org/abs/2411.03004v1
- Date: Tue, 05 Nov 2024 11:05:53 GMT
- Title: Controlling for Unobserved Confounding with Large Language Model Classification of Patient Smoking Status
- Authors: Samuel Lee, Zach Wood-Doughty,
- Abstract summary: Causal understanding is a fundamental goal of evidence-based medicine.
Prior work has proposed to address unobserved confounding with machine learning.
This paper extends this methodology by using a large language model trained on clinical notes to predict patients' smoking status.
- Score: 0.7443139252028033
- License:
- Abstract: Causal understanding is a fundamental goal of evidence-based medicine. When randomization is impossible, causal inference methods allow the estimation of treatment effects from retrospective analysis of observational data. However, such analyses rely on a number of assumptions, often including that of no unobserved confounding. In many practical settings, this assumption is violated when important variables are not explicitly measured in the clinical record. Prior work has proposed to address unobserved confounding with machine learning by imputing unobserved variables and then correcting for the classifier's mismeasurement. When such a classifier can be trained and the necessary assumptions are met, this method can recover an unbiased estimate of a causal effect. However, such work has been limited to synthetic data, simple classifiers, and binary variables. This paper extends this methodology by using a large language model trained on clinical notes to predict patients' smoking status, which would otherwise be an unobserved confounder. We then apply a measurement error correction on the categorical predicted smoking status to estimate the causal effect of transthoracic echocardiography on mortality in the MIMIC dataset.
Related papers
- SepsisLab: Early Sepsis Prediction with Uncertainty Quantification and Active Sensing [67.8991481023825]
Sepsis is the leading cause of in-hospital mortality in the USA.
Existing predictive models are usually trained on high-quality data with few missing information.
For the potential high-risk patients with low confidence due to limited observations, we propose a robust active sensing algorithm.
arXiv Detail & Related papers (2024-07-24T04:47:36Z) - Selective Nonparametric Regression via Testing [54.20569354303575]
We develop an abstention procedure via testing the hypothesis on the value of the conditional variance at a given point.
Unlike existing methods, the proposed one allows to account not only for the value of the variance itself but also for the uncertainty of the corresponding variance predictor.
arXiv Detail & Related papers (2023-09-28T13:04:11Z) - Identifiable causal inference with noisy treatment and no side information [6.432072145009342]
This study proposes a model that assumes a continuous treatment variable that is inaccurately measured.
We prove that our model's causal effect estimates are identifiable, even without side information and knowledge of the measurement error variance.
Our work extends the range of applications in which reliable causal inference can be conducted.
arXiv Detail & Related papers (2023-06-18T18:38:10Z) - Benchmarking Heterogeneous Treatment Effect Models through the Lens of
Interpretability [82.29775890542967]
Estimating personalized effects of treatments is a complex, yet pervasive problem.
Recent developments in the machine learning literature on heterogeneous treatment effect estimation gave rise to many sophisticated, but opaque, tools.
We use post-hoc feature importance methods to identify features that influence the model's predictions.
arXiv Detail & Related papers (2022-06-16T17:59:05Z) - A Machine Learning Model for Predicting, Diagnosing, and Mitigating
Health Disparities in Hospital Readmission [0.0]
We propose a machine learning pipeline capable of making predictions as well as detecting and mitigating biases in the data and model predictions.
We evaluate the performance of the proposed method on a clinical dataset using accuracy and fairness measures.
arXiv Detail & Related papers (2022-06-13T16:07:25Z) - Calibration of prediction rules for life-time outcomes using prognostic
Cox regression survival models and multiple imputations to account for
missing predictor data with cross-validatory assessment [0.0]
Methods are described to combine imputation with predictive calibration in survival modeling subject to censoring.
Prediction-averaging appears to have superior statistical properties, especially smaller predictive variation, as opposed to a direct application of Rubin's rules.
arXiv Detail & Related papers (2021-05-04T20:10:12Z) - Efficient Causal Inference from Combined Observational and
Interventional Data through Causal Reductions [68.6505592770171]
Unobserved confounding is one of the main challenges when estimating causal effects.
We propose a novel causal reduction method that replaces an arbitrary number of possibly high-dimensional latent confounders.
We propose a learning algorithm to estimate the parameterized reduced model jointly from observational and interventional data.
arXiv Detail & Related papers (2021-03-08T14:29:07Z) - Increasing the efficiency of randomized trial estimates via linear
adjustment for a prognostic score [59.75318183140857]
Estimating causal effects from randomized experiments is central to clinical research.
Most methods for historical borrowing achieve reductions in variance by sacrificing strict type-I error rate control.
arXiv Detail & Related papers (2020-12-17T21:10:10Z) - Impact of Medical Data Imprecision on Learning Results [9.379890125442333]
We study the impact of imprecision on prediction results in a healthcare application.
A pre-trained model is used to predict future state of hyperthyroidism for patients.
arXiv Detail & Related papers (2020-07-24T06:54:57Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.