Generative Language Models on Nucleotide Sequences of Human Genes
- URL: http://arxiv.org/abs/2307.10634v1
- Date: Thu, 20 Jul 2023 06:59:02 GMT
- Title: Generative Language Models on Nucleotide Sequences of Human Genes
- Authors: Musa Nuri Ihtiyar and Arzucan Ozgur
- Abstract summary: This study focuses on developing an autoregressive generative language model like GPT-3 for DNA sequences.
Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale.
First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best.
How essential using real-life tasks beyond the classical metrics such as perplexity is observed.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models, primarily transformer-based ones, obtained colossal success
in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3
for NLG are very crucial. DNA sequences are very close to natural language in
terms of structure, so if the DNA-related bioinformatics domain is concerned,
discriminative models, like DNABert, exist. Yet, the generative side of the
coin is mainly unexplored to the best of our knowledge. Consequently, we
focused on developing an autoregressive generative language model like GPT-3
for DNA sequences. Because working with whole DNA sequences is challenging
without substantial computational resources, we decided to carry out our study
on a smaller scale, focusing on nucleotide sequences of human genes, unique
parts in DNA with specific functionalities, instead of the whole DNA. This
decision did not change the problem structure a lot due to the fact that both
DNA and genes can be seen as 1D sequences consisting of four different
nucleotides without losing much information and making too much simplification.
First of all, we systematically examined an almost entirely unexplored problem
and observed that RNNs performed the best while simple techniques like N-grams
were also promising. Another beneficial point was learning how to work with
generative models on languages we do not understand, unlike natural language.
How essential using real-life tasks beyond the classical metrics such as
perplexity is observed. Furthermore, checking whether the data-hungry nature of
these models can be changed through selecting a language with minimal
vocabulary size, four owing to four different types of nucleotides, is
examined. The reason for reviewing this was that choosing such a language might
make the problem easier. However, what we observed in this study was it did not
provide that much of a change in the amount of data needed.
Related papers
- HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture.
This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.
HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z) - Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA [44.630039477717624]
MxDNA is a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.
We show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining.
arXiv Detail & Related papers (2024-12-18T10:55:43Z) - Can linguists better understand DNA? [0.0]
This study addresses the existence of capability transfer between natural language and gene sequences/languages.
We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination)
These tasks were designed to validate the transferability of capabilities from natural language to gene sequences.
arXiv Detail & Related papers (2024-12-10T17:06:33Z) - DNAHLM -- DNA sequence and Human Language mixed large language Model [0.0]
This paper introduces a pre-trained model trained on the GPT-2 network, combining DNA sequences and English text.
We then convert classification and other downstream tasks into Alpaca format instruction data, and perform instruction fine-tuning.
The model has demonstrated its effectiveness in DNA related zero-shot prediction and multitask application.
arXiv Detail & Related papers (2024-10-22T11:51:09Z) - A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology.
Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function.
Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Efficient Automation of Neural Network Design: A Survey on
Differentiable Neural Architecture Search [70.31239620427526]
Differentiable Neural Architecture Search (DNAS) rapidly imposed itself as the trending approach to automate the discovery of deep neural network architectures.
This rise is mainly due to the popularity of DARTS, one of the first major DNAS methods.
In this comprehensive survey, we focus specifically on DNAS and review recent approaches in this field.
arXiv Detail & Related papers (2023-04-11T13:15:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.