Omni-DNA: A Unified Genomic Foundation Model for Cross-Modal and Multi-Task Learning
- URL: http://arxiv.org/abs/2502.03499v1
- Date: Wed, 05 Feb 2025 09:20:52 GMT
- Title: Omni-DNA: A Unified Genomic Foundation Model for Cross-Modal and Multi-Task Learning
- Authors: Zehui Li, Vallijah Subasri, Yifei Shen, Dongsheng Li, Yiren Zhao, Guy-Bart Stan, Caihua Shan,
- Abstract summary: We introduce Omni-DNA, a family of cross-modal multi-task models ranging from 20 million to 1 billion parameters.
Our approach consists of two stages: (i) pretraining on DNA sequences with next token prediction objective, and (ii) expanding the multi-modal task-specific tokens and finetuning for multiple downstream tasks simultaneously.
When evaluated on the Nucleotide Transformer and GB benchmarks, Omni-DNA achieves state-of-the-art performance on 18 out of 26 tasks.
- Score: 30.762037633773257
- License:
- Abstract: Large Language Models (LLMs) demonstrate remarkable generalizability across diverse tasks, yet genomic foundation models (GFMs) still require separate finetuning for each downstream application, creating significant overhead as model sizes grow. Moreover, existing GFMs are constrained by rigid output formats, limiting their applicability to various genomic tasks. In this work, we revisit the transformer-based auto-regressive models and introduce Omni-DNA, a family of cross-modal multi-task models ranging from 20 million to 1 billion parameters. Our approach consists of two stages: (i) pretraining on DNA sequences with next token prediction objective, and (ii) expanding the multi-modal task-specific tokens and finetuning for multiple downstream tasks simultaneously. When evaluated on the Nucleotide Transformer and GB benchmarks, Omni-DNA achieves state-of-the-art performance on 18 out of 26 tasks. Through multi-task finetuning, Omni-DNA addresses 10 acetylation and methylation tasks at once, surpassing models trained on each task individually. Finally, we design two complex genomic tasks, DNA2Function and Needle-in-DNA, which map DNA sequences to textual functional descriptions and images, respectively, indicating Omni-DNA's cross-modal capabilities to broaden the scope of genomic applications. All the models are available through https://huggingface.co/collections/zehui127
Related papers
- Model Evolution Framework with Genetic Algorithm for Multi-Task Reinforcement Learning [85.91908329457081]
Multi-task reinforcement learning employs a single policy to complete various tasks, aiming to develop an agent with generalizability across different scenarios.
Existing approaches typically use a routing network to generate specific routes for each task and reconstruct a set of modules into diverse models to complete multiple tasks simultaneously.
We propose a Model Evolution framework with Genetic Algorithm (MEGA), which enables the model to evolve during training according to the difficulty of the tasks.
arXiv Detail & Related papers (2025-02-19T09:22:34Z) - HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture.
This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.
HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z) - Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA [44.630039477717624]
MxDNA is a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.
We show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining.
arXiv Detail & Related papers (2024-12-18T10:55:43Z) - DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA [2.543784712990392]
Large genomic DNA language models (DNALMs) aim to learn generalizable representations of diverse DNA elements.
Our benchmarks target biologically meaningful downstream tasks such as functional sequence feature discovery, predicting cell-type specific regulatory activity, and counterfactual prediction of the impacts of genetic variants.
arXiv Detail & Related papers (2024-12-06T21:23:35Z) - MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language [0.24434823694833652]
MAMMAL is a versatile multi-task multi-align foundation model that learns from large-scale biological datasets.
We introduce a prompt syntax that supports a wide range of classification, regression, and generation tasks.
We evaluate the model on 11 diverse downstream tasks spanning different steps within a typical drug discovery pipeline.
arXiv Detail & Related papers (2024-10-28T20:45:52Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence
Analysis Tasks [14.931476374660944]
DNAGPT is a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals.
By enhancing the classic GPT model with a binary classification task, a numerical regression task, and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks.
arXiv Detail & Related papers (2023-07-11T06:30:43Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.