Integrated Analysis for Electronic Health Records with Structured and Sporadic Missingness
- URL: http://arxiv.org/abs/2506.09208v2
- Date: Fri, 10 Oct 2025 17:29:04 GMT
- Title: Integrated Analysis for Electronic Health Records with Structured and Sporadic Missingness
- Authors: Jianbin Tan, Yan Zhang, Chuan Hong, T. Tony Cai, Tianxi Cai, Anru R. Zhang,
- Abstract summary: We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness.<n>By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.
- Score: 14.824094401799556
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Objectives: We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health. Materials and Methods: We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, Macomss, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for Macomss, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS). Results: Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, Macomss achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods. Conclusions: We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.
Related papers
- Generative Conditional Missing Imputation Networks [0.0]
We introduce a sophisticated generative conditional strategy designed to impute missing values within datasets.<n>Specifically, we elucidate the theoretical underpinnings of the Generative Conditional Missing Imputation Networks (GCMI)<n>We enhance the robustness and accuracy of GCMI by integrating a multiple imputation framework using a chained equations approach.
arXiv Detail & Related papers (2026-01-02T00:39:12Z) - DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications [8.530466871734564]
probabilistic clustering models face fundamental challenges in modern data environments.<n>We develop a distributed framework that enables scalable and privacy representation learning from binary data.<n>We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets.
arXiv Detail & Related papers (2025-11-04T17:35:12Z) - Integrating Genomics into Multimodal EHR Foundation Models [56.31910745104141]
This paper introduces an innovative EHR foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality.<n>The framework aims to learn complex relationships between clinical data and genetic predispositions.<n>This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies.
arXiv Detail & Related papers (2025-10-24T15:56:40Z) - Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z) - Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation [192.53529928861818]
Learning with high-resource data has demonstrated substantial success in artificial intelligence (AI)<n>However, the costs associated with data annotation and model training remain significant.<n>This survey employs active sampling theory to analyze the generalization error and label complexity associated with learning from low-resource data.
arXiv Detail & Related papers (2025-10-10T03:15:42Z) - Graph-Convolutional-Beta-VAE for Synthetic Abdominal Aorta Aneurysm Generation [4.363232795241618]
This study presents a beta-Variational Autoencoder Graph Convolutional Neural Network framework for generating synthetic Abdominal Aorta Aneurysms (AAA)<n>Our approach extracts key anatomical features and captures complex statistical relationships within a compact disentangled latent space.<n>The resulting synthetic AAA dataset preserves patient privacy while providing a scalable foundation for medical research, device testing, and computational modeling.
arXiv Detail & Related papers (2025-06-16T15:55:56Z) - Robust Molecular Property Prediction via Densifying Scarce Labeled Data [51.55434084913129]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel meta-learning-based approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.<n>We demonstrate significant performance gains on challenging real-world datasets.
arXiv Detail & Related papers (2025-06-13T15:27:40Z) - Semi-supervised Clustering Through Representation Learning of Large-scale EHR Data [5.591260685112265]
SCORE is a semi-supervised representation learning framework that captures multi-domain disease profiles through patient embeddings.<n>To handle the computational challenges of large-scale data, it introduces a hybrid Expectation-Maximization (EM) and Gaussian Variational Approximation (GVA) algorithm.<n>Our analysis shows that incorporating unlabeled data enhances accuracy and reduces sensitivity to label scarcity.
arXiv Detail & Related papers (2025-05-27T05:20:17Z) - Targeted Data Fusion for Causal Survival Analysis Under Distribution Shift [46.84912148188679]
Causal inference across multiple data sources offers a promising avenue to enhance the generalizability and replicability of scientific findings.<n>Existing approaches fail to address the unique challenges of survival analysis, such as censoring and the integration of discrete and continuous time.<n>We propose two novel methods for estimating target site-specific causal effects in multi-source settings.
arXiv Detail & Related papers (2025-01-30T23:21:25Z) - Precision Adaptive Imputation Network : An Unified Technique for Mixed Datasets [0.0]
This study introduces the Precision Adaptive Imputation Network (PAIN), a novel algorithm designed to enhance data reconstruction.<n>PAIN employs a tri-step process that integrates statistical methods, random forests, and autoencoders, ensuring balanced accuracy and efficiency in imputation.<n>The findings highlight PAIN's superior ability to preserve data distributions and maintain analytical integrity, particularly in complex scenarios where missingness is not completely at random.
arXiv Detail & Related papers (2025-01-18T06:22:27Z) - Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data [44.99833362998488]
Amyotrophic Lateral Sclerosis (ALS) is a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options.
The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app.
arXiv Detail & Related papers (2024-07-10T19:17:23Z) - ICU Bloodstream Infection Prediction: A Transformer-Based Approach for EHR Analysis [0.0]
We introduce RatchetEHR, a novel framework designed for the predictive analysis of electronic health records (EHR) data in intensive care unit (ICU) settings.
R RatchetEHR demonstrates superior predictive performance compared to other methods, including RNN, LSTM, and XGBoost.
A key innovation in RatchetEHR is the integration of the Graph Convolutional Transformer (GCT) component, which significantly enhances the ability to identify hidden structural relationships.
arXiv Detail & Related papers (2024-05-01T19:00:30Z) - Towards Biologically Plausible and Private Gene Expression Data
Generation [47.72947816788821]
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications.
Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions.
We initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data.
arXiv Detail & Related papers (2024-02-07T14:39:11Z) - Missing Data Imputation Based on Dynamically Adaptable Structural Equation Modeling with Self-Attention [2.250580490933205]
This paper proposes dynamically adaptable structural equation modeling (SEM) using a self-attention method (SESA)
SESA innovates beyond traditional SEM-based methods by incorporating self-attention mechanisms.
Our experimental analyses demonstrate the achievement of robust predictive SESA performance for effectively handling missing data in EHR.
arXiv Detail & Related papers (2023-08-23T19:01:17Z) - CEDAR: Communication Efficient Distributed Analysis for Regressions [9.50726756006467]
There are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data.
We propose a novel communication efficient method that aggregates the local optimal estimates, by turning the problem into a missing data problem.
We provide theoretical investigation for the properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses.
arXiv Detail & Related papers (2022-07-01T09:53:44Z) - Reinforcement Learning with Heterogeneous Data: Estimation and Inference [84.72174994749305]
We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity.
We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class.
We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset.
arXiv Detail & Related papers (2022-01-31T20:58:47Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.