GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning
- URL: http://arxiv.org/abs/2407.16940v2
- Date: Thu, 05 Dec 2024 13:30:16 GMT
- Title: GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning
- Authors: Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang,
- Abstract summary: We introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring variable-length contexts and detailed annotations.<n> GV-Rep is designed for deep learning models to learn GV representations across various traits, diseases, tissue types, and experimental contexts.
- Score: 7.365168184468284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Genetic variants (GVs) are defined as differences in the DNA sequences among individuals and play a crucial role in diagnosing and treating genetic diseases. The rapid decrease in next generation sequencing cost has led to an exponential increase in patient-level GV data. This growth poses a challenge for clinicians who must efficiently prioritize patient-specific GVs and integrate them with existing genomic databases to inform patient management. To addressing the interpretation of GVs, genomic foundation models (GFMs) have emerged. However, these models lack standardized performance assessments, leading to considerable variability in model evaluations. This poses the question: How effectively do deep learning methods classify unknown GVs and align them with clinically-verified GVs? We argue that representation learning, which transforms raw data into meaningful feature spaces, is an effective approach for addressing both indexing and classification challenges. We introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring variable-length contexts and detailed annotations, designed for deep learning models to learn GV representations across various traits, diseases, tissue types, and experimental contexts. Our contributions are three-fold: (i) Construction of a comprehensive dataset with 7 million records, each labeled with characteristics of the corresponding variants, alongside additional data from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant combinations, and 156 unique clinically verified GVs from real-world patients. (ii) Analysis of the structure and properties of the dataset. (iii) Experimentation of the dataset with pre-trained GFMs. The results show a significant gap between GFMs current capabilities and accurate GV representation. We hope this dataset will help advance genomic deep learning to bridge this gap.
Related papers
- scReader: Prompting Large Language Models to Interpret scRNA-seq Data [12.767105992391555]
We propose an innovative hybrid approach that integrates the general knowledge capabilities of large language models with domain-specific representation models for single-cell omics data interpretation.
By inputting single-cell gene-level expression data with prompts, we effectively model cellular representations based on the differential expression levels of genes across various species and cell types.
arXiv Detail & Related papers (2024-12-24T04:28:42Z) - A Misclassification Network-Based Method for Comparative Genomic Analysis [3.7671415694914927]
Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades.
In this study, we integrate AI and network science approaches to develop a comparative genomic analysis framework.
arXiv Detail & Related papers (2024-12-09T23:22:15Z) - Integrating Large Language Models for Genetic Variant Classification [12.244115429231888]
Large Language Models (LLMs) have emerged as transformative tools in genetics.
This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense.
Our approach evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets.
arXiv Detail & Related papers (2024-11-07T13:45:56Z) - Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery [56.622854875204645]
We present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth gene-gene interactions.
A novel weighted diversified sampling algorithm computes the diversity score of each data sample in just two passes of the dataset.
arXiv Detail & Related papers (2024-10-21T03:35:23Z) - Embedding-based Multimodal Learning on Pan-Squamous Cell Carcinomas for Improved Survival Outcomes [0.0]
PARADIGM is a framework that learns from multimodal, heterogeneous datasets to improve clinical outcome prediction.
We train GNNs on pan-Squamous Cell Carcinomas and validate our approach on Moffitt Cancer Center lung SCC data.
Our solution aims to understand the patient's circumstances comprehensively, offering insights on heterogeneous data integration and the benefits of converging maximum data views.
arXiv Detail & Related papers (2024-06-11T22:19:14Z) - Unlocking the Power of Multi-institutional Data: Integrating and Harmonizing Genomic Data Across Institutions [3.5489676012585236]
We introduce the Bridge model to derive integrated features to preserve information beyond common genes.
The model consistently excels in predicting patient survival across six cancer types in GENIE BPC data.
arXiv Detail & Related papers (2024-01-30T23:25:05Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Genetic InfoMax: Exploring Mutual Information Maximization in
High-Dimensional Imaging Genetics Studies [50.11449968854487]
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits.
Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS.
We introduce a trans-modal learning framework Genetic InfoMax (GIM) to address the specific challenges of GWAS.
arXiv Detail & Related papers (2023-09-26T03:59:21Z) - Epigenomic language models powered by Cerebras [0.0]
Epigenomic BERT (or EBERT) learns representations based on both DNA sequence and paired epigenetic state inputs.
We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task.
Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard.
arXiv Detail & Related papers (2021-12-14T17:23:42Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.