Related papers: Single-Read Reconstruction for DNA Data Storage Using Transformers

Single-Read Reconstruction for DNA Data Storage Using Transformers

URL: http://arxiv.org/abs/2109.05478v1
Date: Sun, 12 Sep 2021 10:01:59 GMT
Title: Single-Read Reconstruction for DNA Data Storage Using Transformers
Authors: Yotam Nahum, Eyar Ben-Tolila, Leon Anavy
Abstract summary: We propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage. Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand. This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential solution for the future of data storage. Several studies introduced DNA based storage systems with high information density (petabytes/gram). However, DNA synthesis and sequencing technologies yield erroneous outputs. Algorithmic approaches for correcting these errors depend on reading multiple copies of each sequence and result in excessive reading costs. The unprecedented success of Transformers as a deep learning architecture for language modeling has led to its repurposing for solving a variety of tasks across various domains. In this work, we propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage. We address the error correction process as a self-supervised sequence-to-sequence task and use synthetic noise injection to train the model using only the decoded reads. Our approach exploits the inherent redundancy of each decoded file to learn its underlying structure. To demonstrate our proposed approach, we encode text, image and code-script files to DNA, produce errors with high-fidelity error simulator, and reconstruct the original files from the noisy reads. Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand compared to state-of-the-art algorithms using 2-3 copies. This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage which allows for the reduction of the overall cost of the process. We show that this approach is applicable for various domains and can be generalized to new domains as well.

Related papers

Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage [32.00500955709341]
Reed-Solomon coded single-stranded representation learning is a novel end-to-end model for learning representations for DNA storage. In contrast to existing learning-based methods, the proposed RSRL is inspired by both error-correction and structural biology. The experimental results obtained demonstrate that RSRL can store diverse types of data with much higher information density and durability but much lower error rates.
arXiv Detail & Related papers (2024-07-17T06:31:49Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models. We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network. We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z)
Embed-Search-Align: DNA Sequence Alignment using Transformer Models [2.48439258515764]
We bridge the gap by framing the sequence alignment task for Transformer models as an "Embed-Search-Align" task. A novel Reference-Free DNA Embedding model generates embeddings of reads and reference fragments, which are projected into a shared vector space. DNA-ESA is 99% accurate when aligning 250-length reads onto a human genome (3gb), rivaling conventional methods such as Bowtie and BWA-Mem.
arXiv Detail & Related papers (2023-09-20T06:30:39Z)
Implicit Neural Multiple Description for DNA-based data storage [6.423239719448169]
DNA exhibits remarkable potential as a data storage solution due to its impressive storage density and long-term stability. However, developing this novel medium comes with its own set of challenges, particularly in addressing errors arising from storage and biological manipulations. We have pioneered a novel compression scheme and a cutting-edge Multiple Description Coding (MDC) technique utilizing neural networks for DNA data storage.
arXiv Detail & Related papers (2023-09-13T13:42:52Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Image Storage on Synthetic DNA Using Autoencoders [6.096779295981377]
This paper presents some results on lossy image compression methods based on convolutional autoencoders adapted to DNA data storage. The model architectures presented here have been designed to efficiently compress images, encode them into a quaternary code, and finally store them into synthetic DNA molecules.
arXiv Detail & Related papers (2022-03-18T14:17:48Z)
COIN++: Data Agnostic Neural Compression [55.27113889737545]
COIN++ is a neural compression framework that seamlessly handles a wide range of data modalities. We demonstrate the effectiveness of our method by compressing various data modalities.
arXiv Detail & Related papers (2022-01-30T20:12:04Z)
Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning [49.3231734733112]
We show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Product (TP) based Error-Correcting Codes (ECC) and a safety margin into a single coherent pipeline. Our work improves upon the current leading solutions by up to x3200 increase in speed, 40% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime.
arXiv Detail & Related papers (2021-08-31T18:21:20Z)
SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory. We propose StreaMRAK - a streaming version of KRR. We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z)
Recurrence-free unconstrained handwritten text recognition using gated fully convolutional network [2.277447144331876]
Unconstrained handwritten text recognition is a major step in most document analysis tasks. One alternative solution to using LSTM cells is to compensate the long time memory loss with an heavy use of convolutional layers. We present a Gated Fully Convolutional Network architecture that is a recurrence-free alternative to the well-known CNN+LSTM architectures.
arXiv Detail & Related papers (2020-12-09T10:30:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.