Efficient approximation of DNA hybridisation using deep learning
- URL: http://arxiv.org/abs/2102.10131v1
- Date: Fri, 19 Feb 2021 19:23:49 GMT
- Title: Efficient approximation of DNA hybridisation using deep learning
- Authors: David Buterez
- Abstract summary: We present the first comprehensive study of machine learning methods applied to the task of predicting DNA hybridisation.
We introduce a synthetic hybridisation dataset of over 2.5 million data points, enabling the use of a wide range of machine learning algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deoxyribonucleic acid (DNA) has shown great promise in enabling computational
applications, most notably in the fields of DNA data storage and DNA computing.
The former exploits the natural properties of DNA, such as high storage density
and longevity, for the archival of digital information, while the latter aims
to use the interactivity of DNA to encode computations. Recently, the two
paradigms were jointly used to formulate the near-data processing concept for
DNA databases, where the computations are performed directly on the stored
data. The fundamental, low-level operation that DNA naturally possesses is that
of hybridisation, also called annealing, of complementary sequences.
Information is encoded as DNA strands, which will naturally bind in solution,
thus enabling search and pattern-matching capabilities. Being able to control
and predict the process of hybridisation is crucial for the ambitious future of
the so-called Hybrid Molecular-Electronic Computing. Current tools are,
however, limited in terms of throughput and applicability to large-scale
problems.
In this work, we present the first comprehensive study of machine learning
methods applied to the task of predicting DNA hybridisation. For this purpose,
we introduce a synthetic hybridisation dataset of over 2.5 million data points,
enabling the use of a wide range of machine learning algorithms, including the
latest in deep learning. Depending on the hardware, the proposed models provide
a reduction in inference time ranging from one to over two orders of magnitude
compared to the state-of-the-art, while retaining high fidelity. We then
discuss the integration of our methods in modern, scalable workflows. The
implementation is available at:
https://github.com/davidbuterez/dna-hyb-deep-learning
Related papers
- A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology.
Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function.
Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z) - DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation
Models [8.159258510270243]
We introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings.
We introduce MI-Mix, a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer.
Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance.
arXiv Detail & Related papers (2024-02-13T20:21:29Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Efficient Automation of Neural Network Design: A Survey on
Differentiable Neural Architecture Search [70.31239620427526]
Differentiable Neural Architecture Search (DNAS) rapidly imposed itself as the trending approach to automate the discovery of deep neural network architectures.
This rise is mainly due to the popularity of DARTS, one of the first major DNAS methods.
In this comprehensive survey, we focus specifically on DNAS and review recent approaches in this field.
arXiv Detail & Related papers (2023-04-11T13:15:29Z) - Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA
Storage [4.447467536572626]
Levenshtein distance is the most suitable metric on the similarity between two DNA sequences.
We propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression.
arXiv Detail & Related papers (2022-07-11T07:59:36Z) - SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features
Learning from a Language Model [3.0643865202019698]
We propose a new solution named SemanticCAP to identify accessible regions of the genome.
It introduces a gene language model which models the context of gene sequences, thus being able to provide an effective representation of gene sequences.
Compared with other systems under public benchmarks, our model proved to have better performance.
arXiv Detail & Related papers (2022-04-05T11:47:58Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and
Deep Learning [49.3231734733112]
We show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Product (TP) based Error-Correcting Codes (ECC) and a safety margin into a single coherent pipeline.
Our work improves upon the current leading solutions by up to x3200 increase in speed, 40% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime.
arXiv Detail & Related papers (2021-08-31T18:21:20Z) - Deep Learning of High-Order Interactions for Protein Interface
Prediction [58.164371994210406]
We propose to formulate the protein interface prediction as a 2D dense prediction problem.
We represent proteins as graphs and employ graph neural networks to learn node features.
We incorporate high-order pairwise interactions to generate a 3D tensor containing different pairwise interactions.
arXiv Detail & Related papers (2020-07-18T05:39:35Z) - One-step regression and classification with crosspoint resistive memory
arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge.
One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition.
Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.