Enhancing Phenotype Discovery in Electronic Health Records through Prior Knowledge-Guided Unsupervised Learning
- URL: http://arxiv.org/abs/2511.02102v1
- Date: Mon, 03 Nov 2025 22:25:07 GMT
- Title: Enhancing Phenotype Discovery in Electronic Health Records through Prior Knowledge-Guided Unsupervised Learning
- Authors: Melanie Mayer, Kimberly Lactaoen, Gary E. Weissman, Blanca E. Himes, Rebecca A. Hubbard,
- Abstract summary: We operationalize a Bayesian latent class framework for phenotyping.<n>We identify an asthma sub-phenotype informed by features of Type 2 (T2) inflammation.
- Score: 1.0414011156637284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objectives: Unsupervised learning with electronic health record (EHR) data has shown promise for phenotype discovery, but approaches typically disregard existing clinical information, limiting interpretability. We operationalize a Bayesian latent class framework for phenotyping that incorporates domain-specific knowledge to improve clinical meaningfulness of EHR-derived phenotypes and illustrate its utility by identifying an asthma sub-phenotype informed by features of Type 2 (T2) inflammation. Materials and methods: We illustrate a framework for incorporating clinical knowledge into a Bayesian latent class model via informative priors to guide unsupervised clustering toward clinically relevant subgroups. This approach models missingness, accounting for potential missing-not-at-random patterns, and provides patient-level probabilities for phenotype assignment with uncertainty. Using reusable and flexible code, we applied the model to a large asthma EHR cohort, specifying informative priors for T2 inflammation-related features and weakly informative priors for other clinical variables, allowing the data to inform posterior distributions. Results and Conclusion: Using encounter data from January 2017 to February 2024 for 44,642 adult asthma patients, we found a bimodal posterior distribution of phenotype assignment, indicating clear class separation. The T2 inflammation-informed class (38.7%) was characterized by elevated eosinophil levels and allergy markers, plus high healthcare utilization and medication use, despite weakly informative priors on the latter variables. These patterns suggest an "uncontrolled T2-high" sub-phenotype. This demonstrates how our Bayesian latent class modeling approach supports hypothesis generation and cohort identification in EHR-based studies of heterogeneous diseases without well-established phenotype definitions.
Related papers
- Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency [52.50039435394964]
We systematically evaluate foundation models for regression-based tasks.<n>We extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models.<n>Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts.
arXiv Detail & Related papers (2026-01-29T14:06:50Z) - AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models [49.550545038402184]
We propose AdaFusion, a novel prompt-guided inference framework.<n>Our method compresses and aligns tile-level features from diverse models.<n>AdaFusion consistently surpasses individual PFMs across both classification and regression tasks.
arXiv Detail & Related papers (2025-08-07T07:09:31Z) - A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies [28.253741893497136]
We propose WEST (WEakly Supervised Transformer for rare disease phenotyping and subphenotyping from EHRs) to enable large-scale phenotyping.<n>We evaluate WEST on two rare pulmonary diseases using EHR data from Boston Children's Hospital and show that it outperforms existing methods in phenotype classification, identification of clinically meaningful subphenotypes, and prediction of disease progression.
arXiv Detail & Related papers (2025-07-01T23:11:20Z) - Supervised Coupled Matrix-Tensor Factorization (SCMTF) for Computational Phenotyping of Patient Reported Outcomes in Ulcerative Colitis [1.2545963971598164]
Phenotyping is the process of distinguishing groups of patients to identify different types of disease progression.<n>Patient-reported symptoms are typically noisy, subjective, and significantly more sparse than other data types.<n>This paper explores the application of computational phenotyping to leverage Patient-Reported Outcomes (PROs) using a novel supervised coupled matrix-tensor factorization (SCMTF) method.
arXiv Detail & Related papers (2025-06-24T23:55:11Z) - On the Within-class Variation Issue in Alzheimer's Disease Detection [66.70233526150723]
Alzheimer's Disease (AD) detection employs machine learning classification models to distinguish between individuals with AD and those without.<n>In this work, we found using a sample score estimator can generate sample-specific soft scores aligning with cognitive scores.<n>We propose two simple yet effective methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe)
arXiv Detail & Related papers (2024-09-22T02:06:05Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - PheME: A deep ensemble framework for improving phenotype prediction from
multi-modal data [42.56953523499849]
We present PheME, an Ensemble framework using Multi-modality data of structured EHRs and unstructured clinical notes for accurate Phenotype prediction.
We leverage ensemble learning to combine outputs from single-modal models and multi-modal models to improve phenotype predictions.
arXiv Detail & Related papers (2023-03-19T23:41:04Z) - Prediction of drug effectiveness in rheumatoid arthritis patients based
on machine learning algorithms [2.5759046095742453]
Rheumatoid arthritis (RA) is an autoimmune condition caused when patients' immune system mistakenly targets their own tissue.
Machine learning (ML) has the potential to identify patterns in patient electronic health records to forecast the best clinical treatment to improve patient outcomes.
This study introduced a Drug Response Prediction (TNF) framework with two main goals: 1) design a data processing pipeline to extract information from clinical data, and then preprocess it for functional use, and 2) predict RA patient's responses to drugs and evaluate classification models' performance.
arXiv Detail & Related papers (2022-10-14T15:15:37Z) - Unsupervised EHR-based Phenotyping via Matrix and Tensor Decompositions [0.6875312133832078]
We provide a comprehensive review of low-rank approximation-based approaches for computational phenotyping.
Recent developments have adapted low-rank data approximation methods by incorporating different constraints and regularizations that facilitate interpretability further.
arXiv Detail & Related papers (2022-09-01T09:47:27Z) - Variational Knowledge Distillation for Disease Classification in Chest
X-Rays [102.04931207504173]
We propose itvariational knowledge distillation (VKD), which is a new probabilistic inference framework for disease classification based on X-rays.
We demonstrate the effectiveness of our method on three public benchmark datasets with paired X-ray images and EHRs.
arXiv Detail & Related papers (2021-03-19T14:13:56Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Attention-based Neural Bag-of-Features Learning for Sequence Data [143.62294358378128]
2D-Attention (2DA) is a generic attention formulation for sequence data.
The proposed attention module is incorporated into the recently proposed Neural Bag of Feature (NBoF) model to enhance its learning capacity.
Our empirical analysis shows that the proposed attention formulations can not only improve performances of NBoF models but also make them resilient to noisy data.
arXiv Detail & Related papers (2020-05-25T17:51:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.