Efficient approximation of DNA hybridisation using deep learning
- URL: http://arxiv.org/abs/2102.10131v1
- Date: Fri, 19 Feb 2021 19:23:49 GMT
- Title: Efficient approximation of DNA hybridisation using deep learning
- Authors: David Buterez
- Abstract summary: We present the first comprehensive study of machine learning methods applied to the task of predicting DNA hybridisation.
We introduce a synthetic hybridisation dataset of over 2.5 million data points, enabling the use of a wide range of machine learning algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deoxyribonucleic acid (DNA) has shown great promise in enabling computational
applications, most notably in the fields of DNA data storage and DNA computing.
The former exploits the natural properties of DNA, such as high storage density
and longevity, for the archival of digital information, while the latter aims
to use the interactivity of DNA to encode computations. Recently, the two
paradigms were jointly used to formulate the near-data processing concept for
DNA databases, where the computations are performed directly on the stored
data. The fundamental, low-level operation that DNA naturally possesses is that
of hybridisation, also called annealing, of complementary sequences.
Information is encoded as DNA strands, which will naturally bind in solution,
thus enabling search and pattern-matching capabilities. Being able to control
and predict the process of hybridisation is crucial for the ambitious future of
the so-called Hybrid Molecular-Electronic Computing. Current tools are,
however, limited in terms of throughput and applicability to large-scale
problems.
In this work, we present the first comprehensive study of machine learning
methods applied to the task of predicting DNA hybridisation. For this purpose,
we introduce a synthetic hybridisation dataset of over 2.5 million data points,
enabling the use of a wide range of machine learning algorithms, including the
latest in deep learning. Depending on the hardware, the proposed models provide
a reduction in inference time ranging from one to over two orders of magnitude
compared to the state-of-the-art, while retaining high fidelity. We then
discuss the integration of our methods in modern, scalable workflows. The
implementation is available at:
https://github.com/davidbuterez/dna-hyb-deep-learning
Related papers
- HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture.
This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.
HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z) - Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA [44.630039477717624]
MxDNA is a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.
We show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining.
arXiv Detail & Related papers (2024-12-18T10:55:43Z) - SemAI: Semantic Artificial Intelligence-enhanced DNA storage for Internet-of-Things [9.858497777817522]
This paper introduces a Semantic Artificial Intelligence-enhanced DNA storage (SemAI-DNA) paradigm, distinguishing itself from prevalent deep learning-based methodologies.
Numerical results demonstrate the SemAI-DNA's efficacy, attaining 2.61 dB Peak Signal-to-Noise Ratio (PSNR) gain and 0.13 improvement in Structural Similarity Index (SSIM) over conventional deep learning-based approaches.
arXiv Detail & Related papers (2024-09-18T12:21:58Z) - A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology.
Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function.
Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Efficient Automation of Neural Network Design: A Survey on
Differentiable Neural Architecture Search [70.31239620427526]
Differentiable Neural Architecture Search (DNAS) rapidly imposed itself as the trending approach to automate the discovery of deep neural network architectures.
This rise is mainly due to the popularity of DARTS, one of the first major DNAS methods.
In this comprehensive survey, we focus specifically on DNAS and review recent approaches in this field.
arXiv Detail & Related papers (2023-04-11T13:15:29Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and
Deep Learning [49.3231734733112]
We show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Product (TP) based Error-Correcting Codes (ECC) and a safety margin into a single coherent pipeline.
Our work improves upon the current leading solutions by up to x3200 increase in speed, 40% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime.
arXiv Detail & Related papers (2021-08-31T18:21:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.