Attention-Based Synthetic Data Generation for Calibration-Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records
- URL: http://arxiv.org/abs/2503.06096v1
- Date: Sat, 08 Mar 2025 06:58:33 GMT
- Title: Attention-Based Synthetic Data Generation for Calibration-Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records
- Authors: Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm,
- Abstract summary: Masked Clinical Modelling (MCM) is an attention-based framework capable of generating high-fidelity synthetic datasets.<n>MCM preserves critical clinical insights, such as hazard ratios, while enhancing survival model calibration.
- Score: 1.7769033811751995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Access to real-world healthcare data is limited by stringent privacy regulations and data imbalances, hindering advancements in research and clinical applications. Synthetic data presents a promising solution, yet existing methods often fail to ensure the realism, utility, and calibration essential for robust survival analysis. Here, we introduce Masked Clinical Modelling (MCM), an attention-based framework capable of generating high-fidelity synthetic datasets that preserve critical clinical insights, such as hazard ratios, while enhancing survival model calibration. Unlike traditional statistical methods like SMOTE and machine learning models such as VAEs, MCM supports both standalone dataset synthesis for reproducibility and conditional simulation for targeted augmentation, addressing diverse research needs. Validated on a chronic kidney disease electronic health records dataset, MCM reduced the general calibration loss over the entire dataset by 15%; and MCM reduced a mean calibration loss by 9% across 10 clinically stratified subgroups, outperforming 15 alternative methods. By bridging data accessibility with translational utility, MCM advances the precision of healthcare models, promoting more efficient use of scarce healthcare resources.
Related papers
- Prediction of Lung Metastasis from Hepatocellular Carcinoma using the SEER Database [0.9055332067000195]
Hepatocellular carcinoma (HCC) is a leading cause of cancer-related mortality.
predictive models for lung metastasis inHCC remain limited in scope and clinical applicability.
We develop and validate an end-to-end machine learning pipeline using data from the Surveillance, Epidemiology, and End Results (SEER) database.
arXiv Detail & Related papers (2025-01-20T20:06:31Z) - Masked Clinical Modelling: A Framework for Synthetic and Augmented Survival Data Generation [1.7769033811751995]
We present Masked Clinical Modelling (MCM), a framework inspired by masked language modelling.
MCM is designed for both data synthesis and conditional data augmentation.
We evaluate this prototype on the WHAS500 dataset using Cox Proportional Hazards models.
arXiv Detail & Related papers (2024-10-22T08:38:46Z) - Novel Development of LLM Driven mCODE Data Model for Improved Clinical Trial Matching to Enable Standardization and Interoperability in Oncology Research [0.15346678870160887]
Cancer costs reaching over $208 billion in 2023 alone.
Traditional methods regarding clinical trial enrollment and clinical care in oncology are often manual, time-consuming, and lack a data-driven approach.
This paper presents a novel framework to streamline standardization, interoperability, and exchange of cancer domains.
arXiv Detail & Related papers (2024-10-18T17:31:35Z) - Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites:
A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area.
We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions.
We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z) - Mixed-Integer Projections for Automated Data Correction of EMRs Improve
Predictions of Sepsis among Hospitalized Patients [7.639610349097473]
We introduce an innovative projections-based method that seamlessly integrates clinical expertise as domain constraints.
We measure the distance of corrected data from the constraints defining a healthy range of patient data, resulting in a unique predictive metric we term as "trust-scores"
We show an AUROC of 0.865 and a precision of 0.922, that surpasses conventional ML models without such projections.
arXiv Detail & Related papers (2023-08-21T15:14:49Z) - Automatic diagnosis of knee osteoarthritis severity using Swin
transformer [55.01037422579516]
Knee osteoarthritis (KOA) is a widespread condition that can cause chronic pain and stiffness in the knee joint.
We propose an automated approach that employs the Swin Transformer to predict the severity of KOA.
arXiv Detail & Related papers (2023-07-10T09:49:30Z) - Large Language Models for Healthcare Data Augmentation: An Example on
Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM)
Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z) - Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty [52.03490691733464]
We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.
By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation.
DeviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data.
arXiv Detail & Related papers (2023-01-01T05:02:46Z) - Density-Aware Personalized Training for Risk Prediction in Imbalanced
Medical Data [89.79617468457393]
Training models with imbalance rate (class density discrepancy) may lead to suboptimal prediction.
We propose a framework for training models for this imbalance issue.
We demonstrate our model's improved performance in real-world medical datasets.
arXiv Detail & Related papers (2022-07-23T00:39:53Z) - BSM loss: A superior way in modeling aleatory uncertainty of
fine_grained classification [0.0]
We propose a modified Bootstrapping loss(BS loss) function with Mixup data augmentation strategy.
Our experiments indicated that BS loss with Mixup(BSM) model can halve the Expected Error(ECE) compared to standard data augmentation.
BSM model is able to perceive the semantic distance of out-of-domain data, demonstrating high potential in real-world clinical practice.
arXiv Detail & Related papers (2022-06-09T13:06:51Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.