Model-assisted cohort selection with bias analysis for generating
large-scale cohorts from the EHR for oncology research
- URL: http://arxiv.org/abs/2001.09765v1
- Date: Mon, 13 Jan 2020 22:58:48 GMT
- Title: Model-assisted cohort selection with bias analysis for generating
large-scale cohorts from the EHR for oncology research
- Authors: Benjamin Birnbaum, Nathan Nussbaum, Katharina Seidl-Rathkopf, Monica
Agrawal, Melissa Estevez, Evan Estola, Joshua Haimson, Lucy He, Peter Larson,
Paul Richardson
- Abstract summary: We introduce a technique called Model-Assisted Cohort Selection (MACS) with Bias Analysis.
We trained a model on 17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and logistic regression.
We used a test set of 17,292 patients to measure algorithm performance and perform Bias Analysis.
- Score: 1.25957368859589
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Objective Electronic health records (EHRs) are a promising source of data for
health outcomes research in oncology. A challenge in using EHR data is that
selecting cohorts of patients often requires information in unstructured parts
of the record. Machine learning has been used to address this, but even
high-performing algorithms may select patients in a non-random manner and bias
the resulting cohort. To improve the efficiency of cohort selection while
measuring potential bias, we introduce a technique called Model-Assisted Cohort
Selection (MACS) with Bias Analysis and apply it to the selection of metastatic
breast cancer (mBC) patients. Materials and Methods We trained a model on
17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and
logistic regression. We used a test set of 17,292 patients to measure algorithm
performance and perform Bias Analysis. We compared the cohort generated by MACS
to the cohort that would have been generated without MACS as reference
standard, first by comparing distributions of an extensive set of clinical and
demographic variables and then by comparing the results of two analyses
addressing existing example research questions. Results Our algorithm had an
area under the curve (AUC) of 0.976, a sensitivity of 96.0%, and an abstraction
efficiency gain of 77.9%. During Bias Analysis, we found no large differences
in baseline characteristics and no differences in the example analyses.
Conclusion MACS with bias analysis can significantly improve the efficiency of
cohort selection on EHR data while instilling confidence that outcomes research
performed on the resulting cohort will not be biased.
Related papers
- Automatic Cough Analysis for Non-Small Cell Lung Cancer Detection [33.37223681850477]
Early detection of non-small cell lung cancer (NSCLC) is critical for improving patient outcomes.<n>We explore the use of automatic cough analysis as a pre-screening tool for distinguishing between NSCLC patients and healthy controls.<n>Recordings were analyzed using machine learning techniques, such as support vector machine (SVM) and XGBoost.
arXiv Detail & Related papers (2025-07-25T11:30:22Z) - A Goemans-Williamson type algorithm for identifying subcohorts in clinical trials [1.930852251165745]
We design an efficient algorithm that outputs a linear classifier for identifying homogeneous subsets from large inhomogeneous datasets.<n>As an application, we use our algorithm to design a simple test that can identify homogeneous subcohorts of patients.<n>We also use the test output by the algorithm to systematically identify subcohorts of patients in which statistically significant changes in methylation levels of tumor suppressor genes co-occur with statistically significant changes in nuclear receptor expression.
arXiv Detail & Related papers (2025-06-12T16:44:32Z) - Equitable Length of Stay Prediction for Patients with Learning Disabilities and Multiple Long-term Conditions Using Machine Learning [1.0064817439176887]
This study analyses hospitalisations of 9,618 patients identified with learning disabilities and long-term conditions for the population of Wales.
We describe the demographic characteristics, prevalence of long-term conditions, medication history, hospital visits, and lifestyle history for our study cohort.
We apply machine learning models to predict the length of hospital stays for this cohort.
arXiv Detail & Related papers (2024-11-03T20:14:20Z) - Optimizing Mortality Prediction for ICU Heart Failure Patients: Leveraging XGBoost and Advanced Machine Learning with the MIMIC-III Database [1.5186937600119894]
Heart failure affects millions of people worldwide, significantly reducing quality of life and leading to high mortality rates.
Despite extensive research, the relationship between heart failure and mortality rates among ICU patients is not fully understood.
This study analyzed data from 1,177 patients over 18 years old from the MIMIC-III database, identified using ICD-9 codes.
arXiv Detail & Related papers (2024-09-03T07:57:08Z) - Application of Machine Learning Algorithms in Classifying Postoperative Success in Metabolic Bariatric Surgery: A Comprehensive Study [0.32985979395737786]
This study presents a novel machine learning approach to classify patients in the context of metabolic bariatric surgery.
Various machine learning models, including GaussianNB, ComplementNB, KNN, Decision Tree, KNN with RandomOverSampler, and KNN with SMOTE, were applied to a dataset of 73 patients.
arXiv Detail & Related papers (2024-03-29T11:27:37Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - A method for comparing multiple imputation techniques: a case study on
the U.S. National COVID Cohort Collaborative [1.259457977936316]
We numerically evaluate strategies for handling missing data in the context of statistical analysis.
Our approach could effectively highlight the most valid and performant missing-data handling strategy.
arXiv Detail & Related papers (2022-06-13T19:49:54Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Adversarial Sample Enhanced Domain Adaptation: A Case Study on
Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation.
adversarially generated samples are used during domain adaptation.
Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z) - Increasing the efficiency of randomized trial estimates via linear
adjustment for a prognostic score [59.75318183140857]
Estimating causal effects from randomized experiments is central to clinical research.
Most methods for historical borrowing achieve reductions in variance by sacrificing strict type-I error rate control.
arXiv Detail & Related papers (2020-12-17T21:10:10Z) - Hemogram Data as a Tool for Decision-making in COVID-19 Management:
Applications to Resource Scarcity Scenarios [62.997667081978825]
COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure.
This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients.
Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity.
arXiv Detail & Related papers (2020-05-10T01:45:03Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.