DNA data storage, sequencing data-carrying DNA
- URL: http://arxiv.org/abs/2205.05488v1
- Date: Wed, 11 May 2022 13:31:57 GMT
- Title: DNA data storage, sequencing data-carrying DNA
- Authors: Jasmine Quah, Omer Sella, Thomas Heinis
- Abstract summary: We study accuracy trade-offs between deep model size and error correcting codes.
We show that, starting with a model size of 107MB, the reduced accuracy from model compression can be compensated by using simple error correcting codes.
- Score: 2.4493299476776778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DNA is a leading candidate as the next archival storage media due to its
density, durability and sustainability. To read (and write) data DNA storage
exploits technology that has been developed over decades to sequence naturally
occurring DNA in the life sciences. To achieve higher accuracy for previously
unseen, biological DNA, sequencing relies on extending and training deep
machine learning models known as basecallers. This growth in model complexity
requires substantial resources, both computational and data sets. It also
eliminates the possibility of a compact read head for DNA as a storage medium.
We argue that we need to depart from blindly using sequencing models from the
life sciences for DNA data storage. The difference is striking: for life
science applications we have no control over the DNA, however, in the case of
DNA data storage, we control how it is written, as well as the particular write
head. More specifically, data-carrying DNA can be modulated and embedded with
alignment markers and error correcting codes to guarantee higher fidelity and
to carry out some of the work that the machine learning models perform.
In this paper, we study accuracy trade-offs between deep model size and error
correcting codes. We show that, starting with a model size of 107MB, the
reduced accuracy from model compression can be compensated by using simple
error correcting codes in the DNA sequences. In our experiments, we show that a
substantial reduction in the size of the model does not incur an undue penalty
for the error correcting codes used, therefore paving the way for portable
data-carrying DNA read head. Crucially, we show that through the joint use of
model compression and error correcting codes, we achieve a higher read accuracy
than without compression and error correction codes.
Related papers
- VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - Implicit Neural Multiple Description for DNA-based data storage [6.423239719448169]
DNA exhibits remarkable potential as a data storage solution due to its impressive storage density and long-term stability.
However, developing this novel medium comes with its own set of challenges, particularly in addressing errors arising from storage and biological manipulations.
We have pioneered a novel compression scheme and a cutting-edge Multiple Description Coding (MDC) technique utilizing neural networks for DNA data storage.
arXiv Detail & Related papers (2023-09-13T13:42:52Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Validation tests of GBS quantum computers give evidence for quantum
advantage with a decoherent target [62.997667081978825]
We use positive-P phase-space simulations of grouped count probabilities as a fingerprint for verifying multi-mode data.
We show how one can disprove faked data, and apply this to a classical count algorithm.
arXiv Detail & Related papers (2022-11-07T12:00:45Z) - Image Storage on Synthetic DNA Using Autoencoders [6.096779295981377]
This paper presents some results on lossy image compression methods based on convolutional autoencoders adapted to DNA data storage.
The model architectures presented here have been designed to efficiently compress images, encode them into a quaternary code, and finally store them into synthetic DNA molecules.
arXiv Detail & Related papers (2022-03-18T14:17:48Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - Single-Read Reconstruction for DNA Data Storage Using Transformers [0.0]
We propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage.
Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand.
This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage.
arXiv Detail & Related papers (2021-09-12T10:01:59Z) - Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and
Deep Learning [49.3231734733112]
We show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Product (TP) based Error-Correcting Codes (ECC) and a safety margin into a single coherent pipeline.
Our work improves upon the current leading solutions by up to x3200 increase in speed, 40% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime.
arXiv Detail & Related papers (2021-08-31T18:21:20Z) - Efficient approximation of DNA hybridisation using deep learning [0.0]
We present the first comprehensive study of machine learning methods applied to the task of predicting DNA hybridisation.
We introduce a synthetic hybridisation dataset of over 2.5 million data points, enabling the use of a wide range of machine learning algorithms.
arXiv Detail & Related papers (2021-02-19T19:23:49Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.