Efficient approximation of DNA hybridisation using deep learning
- URL: http://arxiv.org/abs/2102.10131v1
- Date: Fri, 19 Feb 2021 19:23:49 GMT
- Title: Efficient approximation of DNA hybridisation using deep learning
- Authors: David Buterez
- Abstract summary: We present the first comprehensive study of machine learning methods applied to the task of predicting DNA hybridisation.
We introduce a synthetic hybridisation dataset of over 2.5 million data points, enabling the use of a wide range of machine learning algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deoxyribonucleic acid (DNA) has shown great promise in enabling computational
applications, most notably in the fields of DNA data storage and DNA computing.
The former exploits the natural properties of DNA, such as high storage density
and longevity, for the archival of digital information, while the latter aims
to use the interactivity of DNA to encode computations. Recently, the two
paradigms were jointly used to formulate the near-data processing concept for
DNA databases, where the computations are performed directly on the stored
data. The fundamental, low-level operation that DNA naturally possesses is that
of hybridisation, also called annealing, of complementary sequences.
Information is encoded as DNA strands, which will naturally bind in solution,
thus enabling search and pattern-matching capabilities. Being able to control
and predict the process of hybridisation is crucial for the ambitious future of
the so-called Hybrid Molecular-Electronic Computing. Current tools are,
however, limited in terms of throughput and applicability to large-scale
problems.
In this work, we present the first comprehensive study of machine learning
methods applied to the task of predicting DNA hybridisation. For this purpose,
we introduce a synthetic hybridisation dataset of over 2.5 million data points,
enabling the use of a wide range of machine learning algorithms, including the
latest in deep learning. Depending on the hardware, the proposed models provide
a reduction in inference time ranging from one to over two orders of magnitude
compared to the state-of-the-art, while retaining high fidelity. We then
discuss the integration of our methods in modern, scalable workflows. The
implementation is available at:
https://github.com/davidbuterez/dna-hyb-deep-learning
Related papers
- Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery [56.622854875204645]
We present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth gene-gene interactions.
A novel weighted diversified sampling algorithm computes the diversity score of each data sample in just two passes of the dataset.
arXiv Detail & Related papers (2024-10-21T03:35:23Z) - SemAI: Semantic Artificial Intelligence-enhanced DNA storage for Internet-of-Things [9.858497777817522]
This paper introduces a Semantic Artificial Intelligence-enhanced DNA storage (SemAI-DNA) paradigm, distinguishing itself from prevalent deep learning-based methodologies.
Numerical results demonstrate the SemAI-DNA's efficacy, attaining 2.61 dB Peak Signal-to-Noise Ratio (PSNR) gain and 0.13 improvement in Structural Similarity Index (SSIM) over conventional deep learning-based approaches.
arXiv Detail & Related papers (2024-09-18T12:21:58Z) - A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology.
Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function.
Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Efficient Automation of Neural Network Design: A Survey on
Differentiable Neural Architecture Search [70.31239620427526]
Differentiable Neural Architecture Search (DNAS) rapidly imposed itself as the trending approach to automate the discovery of deep neural network architectures.
This rise is mainly due to the popularity of DARTS, one of the first major DNAS methods.
In this comprehensive survey, we focus specifically on DNAS and review recent approaches in this field.
arXiv Detail & Related papers (2023-04-11T13:15:29Z) - Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA
Storage [4.447467536572626]
Levenshtein distance is the most suitable metric on the similarity between two DNA sequences.
We propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression.
arXiv Detail & Related papers (2022-07-11T07:59:36Z) - SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features
Learning from a Language Model [3.0643865202019698]
We propose a new solution named SemanticCAP to identify accessible regions of the genome.
It introduces a gene language model which models the context of gene sequences, thus being able to provide an effective representation of gene sequences.
Compared with other systems under public benchmarks, our model proved to have better performance.
arXiv Detail & Related papers (2022-04-05T11:47:58Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and
Deep Learning [49.3231734733112]
We show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Product (TP) based Error-Correcting Codes (ECC) and a safety margin into a single coherent pipeline.
Our work improves upon the current leading solutions by up to x3200 increase in speed, 40% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime.
arXiv Detail & Related papers (2021-08-31T18:21:20Z) - One-step regression and classification with crosspoint resistive memory
arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge.
One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition.
Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.