Related papers: How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?

How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?

URL: http://arxiv.org/abs/2508.07127v1
Date: Sun, 10 Aug 2025 00:19:29 GMT
Title: How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?
Authors: Niranjana Arun Menon, Iqra Farooq, Yulong Li, Sara Ahmed, Yutong Xie, Muhammad Awais, Imran Razzak,
Abstract summary: We explore the potential of fine-tuned LLMs to predict cardiac diseases and SNPs.<n>We evaluate how LLMs can learn latent biological relationships from structured and semi-structured genomic data.<n>The findings highlight the promise of LLMs in contributing to early detection, risk assessment, and ultimately, the advancement of personalized medicine in cardiac care.
Score: 20.329484401428815
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Cardiovascular disease (CVD) prediction remains a tremendous challenge due to its multifactorial etiology and global burden of morbidity and mortality. Despite the growing availability of genomic and electrophysiological data, extracting biologically meaningful insights from such high-dimensional, noisy, and sparsely annotated datasets remains a non-trivial task. Recently, LLMs has been applied effectively to predict structural variations in biological sequences. In this work, we explore the potential of fine-tuned LLMs to predict cardiac diseases and SNPs potentially leading to CVD risk using genetic markers derived from high-throughput genomic profiling. We investigate the effect of genetic patterns associated with cardiac conditions and evaluate how LLMs can learn latent biological relationships from structured and semi-structured genomic data obtained by mapping genetic aspects that are inherited from the family tree. By framing the problem as a Chain of Thought (CoT) reasoning task, the models are prompted to generate disease labels and articulate informed clinical deductions across diverse patient profiles and phenotypes. The findings highlight the promise of LLMs in contributing to early detection, risk assessment, and ultimately, the advancement of personalized medicine in cardiac care.

Related papers

Causal and Federated Multimodal Learning for Cardiovascular Risk Prediction under Heterogeneous Populations [0.0]
We create a single multimodal learning framework that integrates cross modal transformers with graph neural networks and causal representation learning to measure personalized CVD risk.<n>The model combines genomic variation, cardiac MRI, ECG waveforms, wearable streams, and structured EHR data to predict risk.<n>This study paves the way for a principled approach to clinically trustworthy, interpretable and privacy respecting CVD prediction at the population level.
arXiv Detail & Related papers (2026-01-05T14:32:49Z)
R-GenIMA: Integrating Neuroimaging and Genetics with Interpretable Multimodal AI for Alzheimer's Disease Progression [63.97617759805451]
Early detection of Alzheimer's disease requires models capable of integrating macro-scale neuroanatomical alterations with micro-scale genetic susceptibility.<n>We introduce R-GenIMA, an interpretable multimodal large language model that couples a novel ROI-wise vision transformer with genetic prompting.<n>R-GenIMA achieves state-of-the-art performance in four-way classification across normal cognition, subjective memory concerns, mild cognitive impairment, and AD.
arXiv Detail & Related papers (2025-12-22T02:54:10Z)
Few-Label Multimodal Modeling of SNP Variants and ECG Phenotypes Using Large Language Models for Cardiovascular Risk Stratification [21.890853284710776]
We present a few-label multimodal framework to combine genetic and electrophysiological information for cardiovascular risk stratification.<n>We frame the task as a Chain of Thought (CoT) reasoning problem, prompting the model to produce clinically relevant rationales alongside predictions.<n> Experimental results demonstrate that the integration of multimodal inputs, few-label supervision, and CoT reasoning improves robustness and generalizability across diverse patient profiles.
arXiv Detail & Related papers (2025-10-18T15:19:35Z)
Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z)
Multi-modal Integration Analysis of Alzheimer's Disease Using Large Language Models and Knowledge Graphs [0.33554367023486936]
We propose a novel framework for integrating fragmented multi-modal data in Alzheimer's disease (AD) research using large language models (LLMs) and knowledge graphs.<n>Our approach demonstrates population-level integration of MRI, gene expression, biomarkers, EEG, and clinical indicators from independent cohorts.
arXiv Detail & Related papers (2025-05-21T16:51:49Z)
Improving Diseases Predictions Utilizing External Bio-Banks [1.9336815376402723]
We demonstrate how machine learning can be leveraged to enhance explainability and uncover biologically meaningful associations.<n>We train LightGBM models from scratch on our dataset (10K) to impute metabolomics features.<n>The imputed metabolomics features are then used in survival analysis to assess their impact on disease-related risk factors.
arXiv Detail & Related papers (2025-03-30T13:05:20Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
Penalized Deep Partially Linear Cox Models with Application to CT Scans of Lung Cancer Patients [42.09584755334577]
Lung cancer is a leading cause of cancer mortality globally, highlighting the importance of understanding its mortality risks to design effective therapies. The National Lung Screening Trial (NLST) employed computed tomography texture analysis to quantify the mortality risks of lung cancer patients. We propose a novel Penalized Deep Partially Linear Cox Model (Penalized DPLC), which incorporates the SCAD penalty to select important texture features and employs a deep neural network to estimate the nonparametric component of the model.
arXiv Detail & Related papers (2023-03-09T15:38:16Z)
Deep Learning of Semi-Competing Risk Data via a New Neural Expectation-Maximization Algorithm [5.253100011321437]
Our motivation comes from the Boston Lung Cancer Study, which investigates how risk factors influence a patient's disease trajectory. We propose a novel neural expectation-maximization algorithm to bridge the gap between classical statistical approaches and machine learning.
arXiv Detail & Related papers (2022-12-22T20:38:57Z)
COVID-Net Biochem: An Explainability-driven Framework to Building Machine Learning Models for Predicting Survival and Kidney Injury of COVID-19 Patients from Clinical and Biochemistry Data [66.43957431843324]
We introduce COVID-Net Biochem, a versatile and explainable framework for constructing machine learning models. We apply this framework to predict COVID-19 patient survival and the likelihood of developing Acute Kidney Injury during hospitalization.
arXiv Detail & Related papers (2022-04-24T07:38:37Z)
SurvLatent ODE : A Neural ODE based time-to-event model with competing risks for longitudinal data improves cancer-associated Deep Vein Thrombosis (DVT) prediction [68.8204255655161]
We propose a generative time-to-event model, SurvLatent ODE, which parameterizes a latent representation under irregularly sampled data. Our model then utilizes the latent representation to flexibly estimate survival times for multiple competing events without specifying shapes of event-specific hazard function. SurvLatent ODE outperforms the current clinical standard Khorana Risk scores for stratifying DVT risk groups.
arXiv Detail & Related papers (2022-04-20T17:28:08Z)
Low-Rank Reorganization via Proportional Hazards Non-negative Matrix Factorization Unveils Survival Associated Gene Clusters [9.773075235189525]
In this work, Cox proportional hazards regression is integrated with NMF by imposing survival constraints. Using human cancer gene expression data, the proposed technique can unravel critical clusters of cancer genes. The discovered gene clusters reflect rich biological implications and can help identify survival-related biomarkers.
arXiv Detail & Related papers (2020-08-09T17:59:30Z)
Trajectories, bifurcations and pseudotime in large clinical datasets: applications to myocardial infarction and diabetes data [94.37521840642141]
We suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values. The methodology is based on application of elastic principal graphs which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection and quantifying the geodesic distances (pseudotime) in partially ordered sequences of observations.
arXiv Detail & Related papers (2020-07-07T21:04:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.