Generative Language Models on Nucleotide Sequences of Human Genes
- URL: http://arxiv.org/abs/2307.10634v1
- Date: Thu, 20 Jul 2023 06:59:02 GMT
- Title: Generative Language Models on Nucleotide Sequences of Human Genes
- Authors: Musa Nuri Ihtiyar and Arzucan Ozgur
- Abstract summary: This study focuses on developing an autoregressive generative language model like GPT-3 for DNA sequences.
Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale.
First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best.
How essential using real-life tasks beyond the classical metrics such as perplexity is observed.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models, primarily transformer-based ones, obtained colossal success
in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3
for NLG are very crucial. DNA sequences are very close to natural language in
terms of structure, so if the DNA-related bioinformatics domain is concerned,
discriminative models, like DNABert, exist. Yet, the generative side of the
coin is mainly unexplored to the best of our knowledge. Consequently, we
focused on developing an autoregressive generative language model like GPT-3
for DNA sequences. Because working with whole DNA sequences is challenging
without substantial computational resources, we decided to carry out our study
on a smaller scale, focusing on nucleotide sequences of human genes, unique
parts in DNA with specific functionalities, instead of the whole DNA. This
decision did not change the problem structure a lot due to the fact that both
DNA and genes can be seen as 1D sequences consisting of four different
nucleotides without losing much information and making too much simplification.
First of all, we systematically examined an almost entirely unexplored problem
and observed that RNNs performed the best while simple techniques like N-grams
were also promising. Another beneficial point was learning how to work with
generative models on languages we do not understand, unlike natural language.
How essential using real-life tasks beyond the classical metrics such as
perplexity is observed. Furthermore, checking whether the data-hungry nature of
these models can be changed through selecting a language with minimal
vocabulary size, four owing to four different types of nucleotides, is
examined. The reason for reviewing this was that choosing such a language might
make the problem easier. However, what we observed in this study was it did not
provide that much of a change in the amount of data needed.
Related papers
- Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - DNAHLM -- DNA sequence and Human Language mixed large language Model [0.0]
This paper introduces a pre-trained model trained on the GPT-2 network, combining DNA sequences and English text.
We then convert classification and other downstream tasks into Alpaca format instruction data, and perform instruction fine-tuning.
The model has demonstrated its effectiveness in DNA related zero-shot prediction and multitask application.
arXiv Detail & Related papers (2024-10-22T11:51:09Z) - A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology.
Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function.
Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z) - Multi-modal Transfer Learning between Biological Foundation Models [2.6545450959042234]
We propose a multi-modal-specific model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality encoders.
We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods.
We open-source our model, paving the way for new multi-modal gene expression approaches.
arXiv Detail & Related papers (2024-06-20T09:44:53Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Efficient Automation of Neural Network Design: A Survey on
Differentiable Neural Architecture Search [70.31239620427526]
Differentiable Neural Architecture Search (DNAS) rapidly imposed itself as the trending approach to automate the discovery of deep neural network architectures.
This rise is mainly due to the popularity of DARTS, one of the first major DNAS methods.
In this comprehensive survey, we focus specifically on DNAS and review recent approaches in this field.
arXiv Detail & Related papers (2023-04-11T13:15:29Z) - SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide
Association Study [48.75445626157713]
SNP2Vec is a scalable self-supervised pre-training approach for understanding SNP.
We apply SNP2Vec to perform long-sequence genomics modeling.
We evaluate the effectiveness of our approach on predicting Alzheimer's disease risk in a Chinese cohort.
arXiv Detail & Related papers (2022-04-14T01:53:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.