Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery
- URL: http://arxiv.org/abs/2407.12051v1
- Date: Sat, 6 Jul 2024 15:08:31 GMT
- Title: Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery
- Authors: Zhiyuan Peng, Yuanbo Tang, Yang Li,
- Abstract summary: textbfDy-mer is an explainable and robust representation scheme based on sparse recovery.
It achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable textbf13% increase in accuracy.
- Score: 6.733319363951907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DNA sequences encode vital genetic and biological information, yet these unfixed-length sequences cannot serve as the input of common data mining algorithms. Hence, various representation schemes have been developed to transform DNA sequences into fixed-length numerical representations. However, these schemes face difficulties in learning high-quality representations due to the complexity and sparsity of DNA data. Additionally, DNA sequences are inherently noisy because of mutations. While several schemes have been proposed for their effectiveness, they often lack semantic structure, making it difficult for biologists to validate and leverage the results. To address these challenges, we propose \textbf{Dy-mer}, an explainable and robust DNA representation scheme based on sparse recovery. Leveraging the underlying semantic structure of DNA, we modify the traditional sparse recovery to capture recurring patterns indicative of biological functions by representing frequent K-mers as basis vectors and reconstructing each DNA sequence through simple concatenation. Experimental results demonstrate that \textbf{Dy-mer} achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable \textbf{13\%} increase in accuracy. Moreover, its inherent explainability facilitates DNA clustering and motif detection, enhancing its utility in biological research.
Related papers
- A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology.
Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function.
Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - Embed-Search-Align: DNA Sequence Alignment using Transformer Models [2.48439258515764]
We bridge the gap by framing the sequence alignment task for Transformer models as an "Embed-Search-Align" task.
A novel Reference-Free DNA Embedding model generates embeddings of reads and reference fragments, which are projected into a shared vector space.
DNA-ESA is 99% accurate when aligning 250-length reads onto a human genome (3gb), rivaling conventional methods such as Bowtie and BWA-Mem.
arXiv Detail & Related papers (2023-09-20T06:30:39Z) - Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence
Analysis Tasks [14.931476374660944]
DNAGPT is a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals.
By enhancing the classic GPT model with a binary classification task, a numerical regression task, and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks.
arXiv Detail & Related papers (2023-07-11T06:30:43Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - DynGFN: Towards Bayesian Inference of Gene Regulatory Networks with
GFlowNets [81.75973217676986]
Gene regulatory networks (GRN) describe interactions between genes and their products that control gene expression and cellular function.
Existing methods either focus on challenge (1), identifying cyclic structure from dynamics, or on challenge (2) learning complex Bayesian posteriors over DAGs, but not both.
In this paper we leverage the fact that it is possible to estimate the "velocity" of gene expression with RNA velocity techniques to develop an approach that addresses both challenges.
arXiv Detail & Related papers (2023-02-08T16:36:40Z) - Graph Neural Networks for Microbial Genome Recovery [64.91162205624848]
We propose to use Graph Neural Networks (GNNs) to leverage the assembly graph when learning contig representations for metagenomic binning.
Our method, VaeG-Bin, combines variational autoencoders for learning latent representations of the individual contigs, with GNNs for refining these representations by taking into account the neighborhood structure of the contigs in the assembly graph.
arXiv Detail & Related papers (2022-04-26T12:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.