Related papers: SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage

SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage

URL: http://arxiv.org/abs/2602.06157v1
Date: Thu, 05 Feb 2026 19:54:13 GMT
Title: SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage
Authors: Cihan Ruan, Lebin Zhou, Rongduo Han, Linyi Han, Bingqing Zhao, Chenchen Zhu, Wei Jiang, Wei Wang, Nam Ling,
Abstract summary: Plug-in module collapses latent compression and DNA encoding into a single step.<n>SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases.<n>Design preserves full reversibility and exploits the hyperprior model's learned priors without modification.
Score: 11.92900213512492
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: DNA storage has matured from concept to practical stage, yet its integration with neural compression pipelines remains inefficient. Early DNA encoders applied redundancy-heavy constraint layers atop raw binary data - workable but primitive. Recent neural codecs compress data into learned latent representations with rich statistical structure, yet still convert these latents to DNA via naive binary-to-quaternary transcoding, discarding the entropy model's optimization. This mismatch undermines compression efficiency and complicates the encoding stack. A plug-in module that collapses latent compression and DNA encoding into a single step. SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases. Its Constraint-Aware Adaptive Coding module dynamically steers the entropy encoder's learned probability distribution to enforce biochemical constraints - Guanine-Cytosine (GC) balance and homopolymer suppression - deterministically during encoding, eliminating post-hoc correction. The design preserves full reversibility and exploits the hyperprior model's learned priors without modification. Experiments show SCONE achieves near-perfect constraint satisfaction with negligible computational overhead (<2% latency), establishing a latent-agnostic interface for end-to-end DNA-compatible learned codecs.

Related papers

DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models [0.0]
DNATok is a GPU-first tokenization system that replaces general-purpose string processing with byte lookup table (LUT)-based identifier streaming and an overlapped host-to-device (H2D)/compute pipeline.<n>DNATok achieves 84-95x higher encoding throughput than optimized Hugging Face baselines and up to 1.9x higher H2D throughput.
arXiv Detail & Related papers (2026-01-09T05:08:17Z)
Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models [8.059385582452112]
FOCUS (Feature-Oriented Compression for Ultra-long Self-attention) is a progressive context-compression module that can be plugged into pretrained DNA LLMs.<n>On held-out human chromosomes, FOCUS achieves near-lossless fidelity.<n>Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N2) to near-linear O(N)
arXiv Detail & Related papers (2025-11-18T17:29:39Z)
OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z)
NEURODNAAI: Neural pipeline approaches for the advancing dna-based information storage as a sustainable digital medium using deep learning framework [0.17398560678845074]
NeuroDNAAI encodes binary data streams into symbolic DNA sequences, transmits them through a noisy channel with substitutions, insertions, and deletions, and reconstructs them with high fidelity.<n>By unifying theory, workflow, and simulation into one pipeline, NeuroDNAAI enables scalable, biologically valid archival DNA storage.
arXiv Detail & Related papers (2025-10-02T15:11:04Z)
Exploiting Discriminative Codebook Prior for Autoregressive Image Generation [54.14166700058777]
token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm.<n>While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited.<n>Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook.<n>We propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means
arXiv Detail & Related papers (2025-08-14T15:00:00Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
Implicit Neural Multiple Description for DNA-based data storage [6.423239719448169]
DNA exhibits remarkable potential as a data storage solution due to its impressive storage density and long-term stability. However, developing this novel medium comes with its own set of challenges, particularly in addressing errors arising from storage and biological manipulations. We have pioneered a novel compression scheme and a cutting-edge Multiple Description Coding (MDC) technique utilizing neural networks for DNA data storage.
arXiv Detail & Related papers (2023-09-13T13:42:52Z)
Deep Quantum Error Correction [73.54643419792453]
Quantum error correction codes (QECC) are a key component for realizing the potential of quantum computing. In this work, we efficiently train novel emphend-to-end deep quantum error decoders. The proposed method demonstrates the power of neural decoders for QECC by achieving state-of-the-art accuracy.
arXiv Detail & Related papers (2023-01-27T08:16:26Z)
COIN++: Data Agnostic Neural Compression [55.27113889737545]
COIN++ is a neural compression framework that seamlessly handles a wide range of data modalities. We demonstrate the effectiveness of our method by compressing various data modalities.
arXiv Detail & Related papers (2022-01-30T20:12:04Z)
Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning [49.3231734733112]
We show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Product (TP) based Error-Correcting Codes (ECC) and a safety margin into a single coherent pipeline. Our work improves upon the current leading solutions by up to x3200 increase in speed, 40% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime.
arXiv Detail & Related papers (2021-08-31T18:21:20Z)
Efficient approximation of DNA hybridisation using deep learning [0.0]
We present the first comprehensive study of machine learning methods applied to the task of predicting DNA hybridisation. We introduce a synthetic hybridisation dataset of over 2.5 million data points, enabling the use of a wide range of machine learning algorithms.
arXiv Detail & Related papers (2021-02-19T19:23:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.