SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide
Association Study
- URL: http://arxiv.org/abs/2204.06699v1
- Date: Thu, 14 Apr 2022 01:53:58 GMT
- Title: SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide
Association Study
- Authors: Samuel Cahyawijaya, Tiezheng Yu, Zihan Liu, Tiffany T.W. Mak, Xiaopu
Zhou, Nancy Y. Ip, Pascale Fung
- Abstract summary: SNP2Vec is a scalable self-supervised pre-training approach for understanding SNP.
We apply SNP2Vec to perform long-sequence genomics modeling.
We evaluate the effectiveness of our approach on predicting Alzheimer's disease risk in a Chinese cohort.
- Score: 48.75445626157713
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Self-supervised pre-training methods have brought remarkable breakthroughs in
the understanding of text, image, and speech. Recent developments in genomics
has also adopted these pre-training methods for genome understanding. However,
they focus only on understanding haploid sequences, which hinders their
applicability towards understanding genetic variations, also known as single
nucleotide polymorphisms (SNPs), which is crucial for genome-wide association
study. In this paper, we introduce SNP2Vec, a scalable self-supervised
pre-training approach for understanding SNP. We apply SNP2Vec to perform
long-sequence genomics modeling, and we evaluate the effectiveness of our
approach on predicting Alzheimer's disease risk in a Chinese cohort. Our
approach significantly outperforms existing polygenic risk score methods and
all other baselines, including the model that is trained entirely with haploid
sequences. We release our code and dataset on
https://github.com/HLTCHKUST/snp2vec.
Related papers
- U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks [5.587500517608073]
Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns.
We introduce a novel U-sampling approach via multi-sublearning for making ensemble predictions.
More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics.
We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies.
arXiv Detail & Related papers (2024-07-22T00:03:51Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction [14.204637932937082]
We introduce a new multi-modal Path-GPTOmic" framework for cancer survival outcome prediction.
We regulate the embedding space of a foundation model, scGPT, initially trained on single-cell RNA-seq data.
We propose a gradient modulation mechanism tailored to the Cox partial likelihood loss for survival prediction.
arXiv Detail & Related papers (2024-03-18T00:02:48Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Toward Understanding BERT-Like Pre-Training for DNA Foundation Models [78.48760388079523]
Existing pre-training methods for DNA sequences rely on direct adoptions of BERT pre-training from NLP.
We introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pre-training by continuously expanding its mask boundary.
RandomMask achieves a staggering 68.16% in Matthew's correlation coefficient for Epigenetic Mark Prediction, a groundbreaking increase of 19.85% over the baseline.
arXiv Detail & Related papers (2023-10-11T16:40:57Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome [10.051595222470304]
We argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models.
We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair$.
We introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints.
arXiv Detail & Related papers (2023-06-26T18:43:46Z) - rfPhen2Gen: A machine learning based association study of brain imaging
phenotypes to genotypes [71.1144397510333]
We learned machine learning models to predict SNPs using 56 brain imaging QTs.
SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest.
Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
arXiv Detail & Related papers (2022-03-31T20:15:22Z) - An Integrated Deep Learning and Dynamic Programming Method for
Predicting Tumor Suppressor Genes, Oncogenes, and Fusion from PDB Structures [0.0]
Mutations in proto-oncogenes (ONGO) and the loss of regulatory function of tumor suppression genes (TSG) are the common underlying mechanism for uncontrolled tumor growth.
Finding the potentiality of the genes related functionality to ONGO or TSG through computational studies can help develop drugs that target the disease.
This paper proposes a classification method that starts with a preprocessing stage to extract the feature map sets from the input 3D protein structural information.
arXiv Detail & Related papers (2021-05-17T18:18:57Z) - EPGAT: Gene Essentiality Prediction With Graph Attention Networks [1.1602089225841632]
We propose EPGAT, an approach for essentiality prediction based on Graph Attention Networks (GATs)
Our model directly learns patterns of gene essentiality from PPI networks, integrating additional evidence from multiomics data encoded as node attributes.
We benchmarked EPGAT for four organisms, including humans, accurately predicting gene essentiality with AUC score ranging from 0.78 to 0.97.
arXiv Detail & Related papers (2020-07-19T13:47:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.