DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence
Analysis Tasks
- URL: http://arxiv.org/abs/2307.05628v3
- Date: Wed, 30 Aug 2023 20:16:55 GMT
- Title: DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence
Analysis Tasks
- Authors: Daoan Zhang, Weitong Zhang, Yu Zhao, Jianguo Zhang, Bing He, Chenchen
Qin, Jianhua Yao
- Abstract summary: DNAGPT is a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals.
By enhancing the classic GPT model with a binary classification task, a numerical regression task, and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks.
- Score: 14.931476374660944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained large language models demonstrate potential in extracting
information from DNA sequences, yet adapting to a variety of tasks and data
modalities remains a challenge. To address this, we propose DNAGPT, a
generalized DNA pre-training model trained on over 200 billion base pairs from
all mammals. By enhancing the classic GPT model with a binary classification
task (DNA sequence order), a numerical regression task (guanine-cytosine
content prediction), and a comprehensive token language, DNAGPT can handle
versatile DNA analysis tasks while processing both sequence and numerical data.
Our evaluation of genomic signal and region recognition, mRNA abundance
regression, and artificial genomes generation tasks demonstrates DNAGPT's
superior performance compared to existing models designed for specific
downstream tasks, benefiting from pre-training using the newly designed model
structure.
Related papers
- DNAHLM -- DNA sequence and Human Language mixed large language Model [0.0]
This paper introduces a pre-trained model trained on the GPT-2 network, combining DNA sequences and English text.
We then convert classification and other downstream tasks into Alpaca format instruction data, and perform instruction fine-tuning.
The model has demonstrated its effectiveness in DNA related zero-shot prediction and multitask application.
arXiv Detail & Related papers (2024-10-22T11:51:09Z) - A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology.
Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function.
Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z) - Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen [76.02070962797794]
We present Cell Flow for Generation, a flow-based conditional generative model for multi-modal single-cell counts.
Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Reprogramming Pretrained Language Models for Antibody Sequence Infilling [72.13295049594585]
Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency.
Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance.
In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data.
arXiv Detail & Related papers (2022-10-05T20:44:55Z) - Epigenomic language models powered by Cerebras [0.0]
Epigenomic BERT (or EBERT) learns representations based on both DNA sequence and paired epigenetic state inputs.
We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task.
Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard.
arXiv Detail & Related papers (2021-12-14T17:23:42Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.