HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
  Resolution
        - URL: http://arxiv.org/abs/2306.15794v2
- Date: Tue, 14 Nov 2023 07:09:04 GMT
- Title: HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
  Resolution
- Authors: Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum
  Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli,
  Yoshua Bengio, Stefano Ermon, Stephen A. Baccus, Chris R\'e
- Abstract summary: We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
- Score: 76.97231739317259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Genomic (DNA) sequences encode an enormous amount of information for gene
regulation and protein synthesis. Similar to natural language models,
researchers have proposed foundation models in genomics to learn generalizable
features from unlabeled genome data that can then be fine-tuned for downstream
tasks such as identifying regulatory elements. Due to the quadratic scaling of
attention, previous Transformer-based genomic models have used 512 to 4k tokens
as context (<0.001% of the human genome), significantly limiting the modeling
of long-range interactions in DNA. In addition, these methods rely on
tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single
nucleotide resolution where subtle genetic variations can completely alter
protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a
large language model based on implicit convolutions was shown to match
attention in quality while allowing longer context lengths and lower time
complexity. Leveraging Hyena's new long-range capabilities, we present
HyenaDNA, a genomic foundation model pretrained on the human reference genome
with context lengths of up to 1 million tokens at the single nucleotide-level -
an up to 500x increase over previous dense attention-based models. HyenaDNA
scales sub-quadratically in sequence length (training up to 160x faster than
Transformer), uses single nucleotide tokens, and has full global context at
each layer. We explore what longer context enables - including the first use of
in-context learning in genomics. On fine-tuned benchmarks from the Nucleotide
Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets
using a model with orders of magnitude less parameters and pretraining data. On
the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by
+10 accuracy points. Code at https://github.com/HazyResearch/hyena-dna.
 
      
        Related papers
        - Learning Genomic Structure from $k$-mers [2.07180164747172]
 We present a method for analyzing read data using contrastive learning.<n>An encoder model is trained to produce embeddings that cluster together sequences from the same genomic region.<n>The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly.
 arXiv  Detail & Related papers  (2025-05-22T13:46:18Z)
- HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
 We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture.
This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.
HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
 arXiv  Detail & Related papers  (2025-02-15T14:23:43Z)
- GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
 We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
 arXiv  Detail & Related papers  (2025-02-11T05:39:49Z)
- Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with   MxDNA [44.630039477717624]
 MxDNA is a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.
We show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining.
 arXiv  Detail & Related papers  (2024-12-18T10:55:43Z)
- VQDNA: Unleashing the Power of Vector Quantization for Multi-Species   Genomic Sequence Modeling [60.91599380893732]
 VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
 arXiv  Detail & Related papers  (2024-05-13T20:15:03Z)
- Efficient and Scalable Fine-Tune of Language Models for Genome
  Understanding [49.606093223945734]
 We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
 arXiv  Detail & Related papers  (2024-02-12T21:40:45Z)
- BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
 We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
 arXiv  Detail & Related papers  (2023-11-21T12:34:00Z)
- scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis
  in Brain [46.39828178736219]
 We introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain.
scHyena is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a bidirectional Hyena operator.
This enables us to process full-length scRNA-seq data without losing any information from the raw data.
 arXiv  Detail & Related papers  (2023-10-04T10:30:08Z)
- Embed-Search-Align: DNA Sequence Alignment using Transformer Models [2.48439258515764]
 We bridge the gap by framing the sequence alignment task for Transformer models as an "Embed-Search-Align" task.
A novel Reference-Free DNA Embedding model generates embeddings of reads and reference fragments, which are projected into a shared vector space.
DNA-ESA is 99% accurate when aligning 250-length reads onto a human genome (3gb), rivaling conventional methods such as Bowtie and BWA-Mem.
 arXiv  Detail & Related papers  (2023-09-20T06:30:39Z)
- DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence
  Analysis Tasks [14.931476374660944]
 DNAGPT is a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals.
By enhancing the classic GPT model with a binary classification task, a numerical regression task, and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks.
 arXiv  Detail & Related papers  (2023-07-11T06:30:43Z)
- Hyena Hierarchy: Towards Larger Convolutional Language Models [115.82857881546089]
 Hyena is a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating.
In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods.
 arXiv  Detail & Related papers  (2023-02-21T18:29:25Z)
- Deep metric learning improves lab of origin prediction of genetically
  engineered plasmids [63.05016513788047]
 Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
 arXiv  Detail & Related papers  (2021-11-24T16:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.