Related papers: Semi-supervised Clustering Through Representation Learning of Large-scale EHR Data

Semi-supervised Clustering Through Representation Learning of Large-scale EHR Data

URL: http://arxiv.org/abs/2505.20731v1
Date: Tue, 27 May 2025 05:20:17 GMT
Title: Semi-supervised Clustering Through Representation Learning of Large-scale EHR Data
Authors: Linshanshan Wang, Mengyan Li, Zongqi Xia, Molei Liu, Tianxi Cai,
Abstract summary: SCORE is a semi-supervised representation learning framework that captures multi-domain disease profiles through patient embeddings.<n>To handle the computational challenges of large-scale data, it introduces a hybrid Expectation-Maximization (EM) and Gaussian Variational Approximation (GVA) algorithm.<n>Our analysis shows that incorporating unlabeled data enhances accuracy and reduces sensitivity to label scarcity.
Score: 5.591260685112265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Electronic Health Records (EHR) offer rich real-world data for personalized medicine, providing insights into disease progression, treatment responses, and patient outcomes. However, their sparsity, heterogeneity, and high dimensionality make them difficult to model, while the lack of standardized ground truth further complicates predictive modeling. To address these challenges, we propose SCORE, a semi-supervised representation learning framework that captures multi-domain disease profiles through patient embeddings. SCORE employs a Poisson-Adapted Latent factor Mixture (PALM) Model with pre-trained code embeddings to characterize codified features and extract meaningful patient phenotypes and embeddings. To handle the computational challenges of large-scale data, it introduces a hybrid Expectation-Maximization (EM) and Gaussian Variational Approximation (GVA) algorithm, leveraging limited labeled data to refine estimates on a vast pool of unlabeled samples. We theoretically establish the convergence of this hybrid approach, quantify GVA errors, and derive SCORE's error rate under diverging embedding dimensions. Our analysis shows that incorporating unlabeled data enhances accuracy and reduces sensitivity to label scarcity. Extensive simulations confirm SCORE's superior finite-sample performance over existing methods. Finally, we apply SCORE to predict disability status for patients with multiple sclerosis (MS) using partially labeled EHR data, demonstrating that it produces more informative and predictive patient embeddings for multiple MS-related conditions compared to existing approaches.

Related papers

A Contrastive Variational AutoEncoder for NSCLC Survival Prediction with Missing Modalities [41.8469011437549]
Predicting survival outcomes for non-small cell lung cancer (NSCLC) patients is challenging due to the different individual prognostic features.<n>State-of-the-art models rely on available data to create patient-level representations or use generative models to infer missing modalities.<n>We propose a Multimodal Contrastive Variational AutoEncoder (MCVAE) to address this issue.
arXiv Detail & Related papers (2026-02-19T14:29:34Z)
Pretraining Transformer-Based Models on Diffusion-Generated Synthetic Graphs for Alzheimer's Disease Prediction [0.0]
We propose a Transformer-based diagnostic framework that combines synthetic data generation with graph representation learning and transfer learning.<n>A class-conditional denoising diffusion probabilistic model (DDPM) is trained on the real-world NACC dataset to generate a large synthetic cohort.<n> Modality-specific Graph Transformer encoders are first pretrained on this synthetic data to learn robust, class-discriminative representations.
arXiv Detail & Related papers (2025-11-24T19:34:53Z)
DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications [8.530466871734564]
probabilistic clustering models face fundamental challenges in modern data environments.<n>We develop a distributed framework that enables scalable and privacy representation learning from binary data.<n>We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets.
arXiv Detail & Related papers (2025-11-04T17:35:12Z)
Learning Robust Diffusion Models from Imprecise Supervision [75.53546939251146]
DMIS is a unified framework for training robust Conditional Diffusion Models from Imprecise Supervision.<n>Our framework is derived from likelihood and decomposes the objective into generative and classification components.<n>Experiments on diverse forms of imprecise supervision, covering tasks covering image generation, weakly supervised learning, and dataset condensation demonstrate that DMIS consistently produces high-quality and class-discriminative samples.
arXiv Detail & Related papers (2025-10-03T14:00:32Z)
Robust Molecular Property Prediction via Densifying Scarce Labeled Data [51.55434084913129]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel meta-learning-based approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.<n>We demonstrate significant performance gains on challenging real-world datasets.
arXiv Detail & Related papers (2025-06-13T15:27:40Z)
MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z)
PRISM: Mitigating EHR Data Sparsity via Learning from Missing Feature Calibrated Prototype Patient Representations [7.075420686441701]
PRISM is a framework that indirectly imputes data by leveraging prototype representations of similar patients.<n> PRISM also includes a feature confidence module, which evaluates the reliability of each feature considering missing statuses.<n>Our experiments on the MIMIC-III, MIMIC-IV, PhysioNet Challenge 2012, eICU datasets demonstrate PRISM's superior performance in predicting in-hospital mortality and 30-day readmission tasks.
arXiv Detail & Related papers (2023-09-08T07:01:38Z)
ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models [69.9178140563928]
Colonoscopy analysis is essential for assisting clinical diagnosis and treatment. The scarcity of annotated data limits the effectiveness and generalization of existing methods. We propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks.
arXiv Detail & Related papers (2023-09-03T07:55:46Z)
Deep Stable Representation Learning on Electronic Health Records [8.256340233221112]
Causal Healthcare Embedding (CHE) aims at eliminating the spurious statistical relationship by removing the dependencies between diagnoses and procedures. Our proposed CHE method can be used as a flexible plug-and-play module that can enhance existing deep learning models on EHR.
arXiv Detail & Related papers (2022-09-03T04:10:45Z)
Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation. GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z)
A Statistics and Deep Learning Hybrid Method for Multivariate Time Series Forecasting and Mortality Modeling [0.0]
Exponential Smoothing Recurrent Neural Network (ES-RNN) is a hybrid between a statistical forecasting model and a recurrent neural network variant. ES-RNN achieves a 9.4% improvement in absolute error in the Makridakis-4 Forecasting Competition.
arXiv Detail & Related papers (2021-12-16T04:44:19Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
Handling Non-ignorably Missing Features in Electronic Health Records Data Using Importance-Weighted Autoencoders [8.518166245293703]
We propose a novel extension of VAEs called Importance-Weighted Autoencoders (IWAEs) to flexibly handle Missing Not At Random patterns in the Physionet data. Our proposed method models the missingness mechanism using an embedded neural network, eliminating the need to specify the exact form of the missingness mechanism a priori.
arXiv Detail & Related papers (2021-01-18T22:53:29Z)
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model. UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD) UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.