SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage
- URL: http://arxiv.org/abs/2602.06157v1
- Date: Thu, 05 Feb 2026 19:54:13 GMT
- Title: SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage
- Authors: Cihan Ruan, Lebin Zhou, Rongduo Han, Linyi Han, Bingqing Zhao, Chenchen Zhu, Wei Jiang, Wei Wang, Nam Ling,
- Abstract summary: Plug-in module collapses latent compression and DNA encoding into a single step.<n>SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases.<n>Design preserves full reversibility and exploits the hyperprior model's learned priors without modification.
- Score: 11.92900213512492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DNA storage has matured from concept to practical stage, yet its integration with neural compression pipelines remains inefficient. Early DNA encoders applied redundancy-heavy constraint layers atop raw binary data - workable but primitive. Recent neural codecs compress data into learned latent representations with rich statistical structure, yet still convert these latents to DNA via naive binary-to-quaternary transcoding, discarding the entropy model's optimization. This mismatch undermines compression efficiency and complicates the encoding stack. A plug-in module that collapses latent compression and DNA encoding into a single step. SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases. Its Constraint-Aware Adaptive Coding module dynamically steers the entropy encoder's learned probability distribution to enforce biochemical constraints - Guanine-Cytosine (GC) balance and homopolymer suppression - deterministically during encoding, eliminating post-hoc correction. The design preserves full reversibility and exploits the hyperprior model's learned priors without modification. Experiments show SCONE achieves near-perfect constraint satisfaction with negligible computational overhead (<2% latency), establishing a latent-agnostic interface for end-to-end DNA-compatible learned codecs.
Related papers
- DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models [0.0]
DNATok is a GPU-first tokenization system that replaces general-purpose string processing with byte lookup table (LUT)-based identifier streaming and an overlapped host-to-device (H2D)/compute pipeline.<n>DNATok achieves 84-95x higher encoding throughput than optimized Hugging Face baselines and up to 1.9x higher H2D throughput.
arXiv Detail & Related papers (2026-01-09T05:08:17Z) - Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models [8.059385582452112]
FOCUS (Feature-Oriented Compression for Ultra-long Self-attention) is a progressive context-compression module that can be plugged into pretrained DNA LLMs.<n>On held-out human chromosomes, FOCUS achieves near-lossless fidelity.<n>Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N2) to near-linear O(N)
arXiv Detail & Related papers (2025-11-18T17:29:39Z) - OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z) - NEURODNAAI: Neural pipeline approaches for the advancing dna-based information storage as a sustainable digital medium using deep learning framework [0.17398560678845074]
NeuroDNAAI encodes binary data streams into symbolic DNA sequences, transmits them through a noisy channel with substitutions, insertions, and deletions, and reconstructs them with high fidelity.<n>By unifying theory, workflow, and simulation into one pipeline, NeuroDNAAI enables scalable, biologically valid archival DNA storage.
arXiv Detail & Related papers (2025-10-02T15:11:04Z) - Exploiting Discriminative Codebook Prior for Autoregressive Image Generation [54.14166700058777]
token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm.<n>While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited.<n>Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook.<n>We propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means
arXiv Detail & Related papers (2025-08-14T15:00:00Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Implicit Neural Multiple Description for DNA-based data storage [6.423239719448169]
DNA exhibits remarkable potential as a data storage solution due to its impressive storage density and long-term stability.
However, developing this novel medium comes with its own set of challenges, particularly in addressing errors arising from storage and biological manipulations.
We have pioneered a novel compression scheme and a cutting-edge Multiple Description Coding (MDC) technique utilizing neural networks for DNA data storage.
arXiv Detail & Related papers (2023-09-13T13:42:52Z) - Deep Quantum Error Correction [73.54643419792453]
Quantum error correction codes (QECC) are a key component for realizing the potential of quantum computing.
In this work, we efficiently train novel emphend-to-end deep quantum error decoders.
The proposed method demonstrates the power of neural decoders for QECC by achieving state-of-the-art accuracy.
arXiv Detail & Related papers (2023-01-27T08:16:26Z) - COIN++: Data Agnostic Neural Compression [55.27113889737545]
COIN++ is a neural compression framework that seamlessly handles a wide range of data modalities.
We demonstrate the effectiveness of our method by compressing various data modalities.
arXiv Detail & Related papers (2022-01-30T20:12:04Z) - Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and
Deep Learning [49.3231734733112]
We show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Product (TP) based Error-Correcting Codes (ECC) and a safety margin into a single coherent pipeline.
Our work improves upon the current leading solutions by up to x3200 increase in speed, 40% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime.
arXiv Detail & Related papers (2021-08-31T18:21:20Z) - Efficient approximation of DNA hybridisation using deep learning [0.0]
We present the first comprehensive study of machine learning methods applied to the task of predicting DNA hybridisation.
We introduce a synthetic hybridisation dataset of over 2.5 million data points, enabling the use of a wide range of machine learning algorithms.
arXiv Detail & Related papers (2021-02-19T19:23:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.