Related papers: Generative Language Models on Nucleotide Sequences of Human Genes

Generative Language Models on Nucleotide Sequences of Human Genes

URL: http://arxiv.org/abs/2307.10634v1
Date: Thu, 20 Jul 2023 06:59:02 GMT
Title: Generative Language Models on Nucleotide Sequences of Human Genes
Authors: Musa Nuri Ihtiyar and Arzucan Ozgur
Abstract summary: This study focuses on developing an autoregressive generative language model like GPT-3 for DNA sequences. Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale. First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best. How essential using real-life tasks beyond the classical metrics such as perplexity is observed.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models, primarily transformer-based ones, obtained colossal success in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3 for NLG are very crucial. DNA sequences are very close to natural language in terms of structure, so if the DNA-related bioinformatics domain is concerned, discriminative models, like DNABert, exist. Yet, the generative side of the coin is mainly unexplored to the best of our knowledge. Consequently, we focused on developing an autoregressive generative language model like GPT-3 for DNA sequences. Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale, focusing on nucleotide sequences of human genes, unique parts in DNA with specific functionalities, instead of the whole DNA. This decision did not change the problem structure a lot due to the fact that both DNA and genes can be seen as 1D sequences consisting of four different nucleotides without losing much information and making too much simplification. First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best while simple techniques like N-grams were also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural language. How essential using real-life tasks beyond the classical metrics such as perplexity is observed. Furthermore, checking whether the data-hungry nature of these models can be changed through selecting a language with minimal vocabulary size, four owing to four different types of nucleotides, is examined. The reason for reviewing this was that choosing such a language might make the problem easier. However, what we observed in this study was it did not provide that much of a change in the amount of data needed.

Related papers

GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z)
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z)
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA [44.630039477717624]
MxDNA is a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. We show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining.
arXiv Detail & Related papers (2024-12-18T10:55:43Z)
Can linguists better understand DNA? [0.0]
This study addresses the existence of capability transfer between natural language and gene sequences/languages. We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination) These tasks were designed to validate the transferability of capabilities from natural language to gene sequences.
arXiv Detail & Related papers (2024-12-10T17:06:33Z)
Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers. It is common to instead use proxy tasks that are similar in only an informal sense. We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z)
DNAHLM -- DNA sequence and Human Language mixed large language Model [0.0]
This paper introduces a pre-trained model trained on the GPT-2 network, combining DNA sequences and English text. We then convert classification and other downstream tasks into Alpaca format instruction data, and perform instruction fine-tuning. The model has demonstrated its effectiveness in DNA related zero-shot prediction and multitask application.
arXiv Detail & Related papers (2024-10-22T11:51:09Z)
A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology. Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function. Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z)
Multi-modal Transfer Learning between Biological Foundation Models [2.6545450959042234]
We propose a multi-modal-specific model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality encoders. We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods. We open-source our model, paving the way for new multi-modal gene expression approaches.
arXiv Detail & Related papers (2024-06-20T09:44:53Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z)
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z)
Efficient Automation of Neural Network Design: A Survey on Differentiable Neural Architecture Search [70.31239620427526]
Differentiable Neural Architecture Search (DNAS) rapidly imposed itself as the trending approach to automate the discovery of deep neural network architectures. This rise is mainly due to the popularity of DARTS, one of the first major DNAS methods. In this comprehensive survey, we focus specifically on DNAS and review recent approaches in this field.
arXiv Detail & Related papers (2023-04-11T13:15:29Z)
SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study [48.75445626157713]
SNP2Vec is a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling. We evaluate the effectiveness of our approach on predicting Alzheimer's disease risk in a Chinese cohort.
arXiv Detail & Related papers (2022-04-14T01:53:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.