Related papers: HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

URL: http://arxiv.org/abs/2306.15794v2
Date: Tue, 14 Nov 2023 07:09:04 GMT
Title: HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Authors: Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, Stefano Ermon, Stephen A. Baccus, Chris R\'e
Abstract summary: We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
Score: 76.97231739317259
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena's new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level - an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points. Code at https://github.com/HazyResearch/hyena-dna.

Related papers

Learning Genomic Structure from $k$-mers [2.07180164747172]
We present a method for analyzing read data using contrastive learning.<n>An encoder model is trained to produce embeddings that cluster together sequences from the same genomic region.<n>The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly.
arXiv Detail & Related papers (2025-05-22T13:46:18Z)
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA [44.630039477717624]
MxDNA is a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. We show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining.
arXiv Detail & Related papers (2024-12-18T10:55:43Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z)
scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis in Brain [46.39828178736219]
We introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain. scHyena is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a bidirectional Hyena operator. This enables us to process full-length scRNA-seq data without losing any information from the raw data.
arXiv Detail & Related papers (2023-10-04T10:30:08Z)
Embed-Search-Align: DNA Sequence Alignment using Transformer Models [2.48439258515764]
We bridge the gap by framing the sequence alignment task for Transformer models as an "Embed-Search-Align" task. A novel Reference-Free DNA Embedding model generates embeddings of reads and reference fragments, which are projected into a shared vector space. DNA-ESA is 99% accurate when aligning 250-length reads onto a human genome (3gb), rivaling conventional methods such as Bowtie and BWA-Mem.
arXiv Detail & Related papers (2023-09-20T06:30:39Z)
DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks [14.931476374660944]
DNAGPT is a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals. By enhancing the classic GPT model with a binary classification task, a numerical regression task, and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks.
arXiv Detail & Related papers (2023-07-11T06:30:43Z)
Hyena Hierarchy: Towards Larger Convolutional Language Models [115.82857881546089]
Hyena is a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods.
arXiv Detail & Related papers (2023-02-21T18:29:25Z)
Deep metric learning improves lab of origin prediction of genetically engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations. We propose a method, based on metric learning, that ranks the most likely labs-of-origin. We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.