Trace Reconstruction with Language Models
- URL: http://arxiv.org/abs/2507.12927v1
- Date: Thu, 17 Jul 2025 09:08:41 GMT
- Title: Trace Reconstruction with Language Models
- Authors: Franziska Weindel, Michael Girsch, Reinhard Heckel,
- Abstract summary: We propose TReconLM, which leverages language models trained on next-token prediction for trace reconstruction.<n>We pretrain language models on synthetic data and fine-tune on real-world data to adapt to technology-specific error patterns.<n>TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep learning approaches.
- Score: 18.61974847244797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by deletions, insertions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of the data retrieval process. In this work, we propose TReconLM, which leverages language models trained on next-token prediction for trace reconstruction. We pretrain language models on synthetic data and fine-tune on real-world data to adapt to technology-specific error patterns. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep learning approaches, recovering a substantially higher fraction of sequences without error.
Related papers
- Under-Sampled High-Dimensional Data Recovery via Symbiotic Multi-Prior Tensor Reconstruction [10.666965599523754]
This work proposes a tensor reconstruction method integrating multiple priors to exploit the inherent structure of the data.<n>Specifically, the method combines learnable decomposition to enforce low-rank constraints of the reconstructed data, a pre-trained convolutional neural network for smoothing and denoising, and block-matching and 3D filtering regularization.<n>Experiments on color images, hyperspectral images, and grayscale videos datasets demonstrate the superiority of our method in extreme cases.
arXiv Detail & Related papers (2025-04-08T12:55:18Z) - Re-Visible Dual-Domain Self-Supervised Deep Unfolding Network for MRI Reconstruction [48.30341580103962]
We propose a novel re-visible dual-domain self-supervised deep unfolding network to address these issues.<n>We design a deep unfolding network based on Chambolle and Pock Proximal Point Algorithm (DUN-CP-PPA) to achieve end-to-end reconstruction.<n> Experiments conducted on the fastMRI and IXI datasets demonstrate that our method significantly outperforms state-of-the-art approaches in terms of reconstruction performance.
arXiv Detail & Related papers (2025-01-07T12:29:32Z) - Mitigating the Learning Bias towards Repetition by Self-Contrastive
Training for Open-Ended Generation [92.42032403795879]
We show that pretrained language models (LMs) such as GPT2 still tend to generate repetitive texts.
We attribute their overestimation of token-level repetition probabilities to the learning bias.
We find that LMs use longer-range dependencies to predict repetitive tokens than non-repetitive ones, which may be the cause of sentence-level repetition loops.
arXiv Detail & Related papers (2023-07-04T07:53:55Z) - Making Reconstruction-based Method Great Again for Video Anomaly
Detection [64.19326819088563]
Anomaly detection in videos is a significant yet challenging problem.
Existing reconstruction-based methods rely on old-fashioned convolutional autoencoders.
We propose a new autoencoder model for enhanced consecutive frame reconstruction.
arXiv Detail & Related papers (2023-01-28T01:57:57Z) - Self-Supervised Training with Autoencoders for Visual Anomaly Detection [61.62861063776813]
We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold.
We adapt a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples.
We achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
arXiv Detail & Related papers (2022-06-23T14:16:30Z) - Gait Cycle Reconstruction and Human Identification from Occluded
Sequences [2.198430261120653]
We propose an effective neural network-based model to reconstruct the occluded frames in an input sequence before carrying out gait recognition.
We employ LSTM networks to predict an embedding for each occluded frame both from the forward and the backward directions.
While the LSTMs are trained to minimize the mean-squared loss, the fusion network is trained to optimize the pixel-wise cross-entropy loss between the ground-truth and the reconstructed samples.
arXiv Detail & Related papers (2022-06-20T16:04:31Z) - Single-Read Reconstruction for DNA Data Storage Using Transformers [0.0]
We propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage.
Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand.
This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage.
arXiv Detail & Related papers (2021-09-12T10:01:59Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese
Grammatical Error Correction [49.25830718574892]
We present a new framework named Tail-to-Tail (textbfTtT) non-autoregressive sequence prediction.
Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected.
Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure.
arXiv Detail & Related papers (2021-06-03T05:56:57Z) - Empirical Error Modeling Improves Robustness of Noisy Neural Sequence
Labeling [26.27504889360246]
We propose an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text.
To overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings.
Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets.
arXiv Detail & Related papers (2021-05-25T12:15:45Z) - Reconstruct Anomaly to Normal: Adversarial Learned and Latent
Vector-constrained Autoencoder for Time-series Anomaly Detection [3.727524403726822]
Anomaly detection in time series has been widely researched and has important practical applications.
In recent years, anomaly detection algorithms are mostly based on deep-learning generative models and use the reconstruction error to detect anomalies.
We propose RAN based on the idea of Reconstruct Anomalies to Normal and apply it for unsupervised time series anomaly detection.
arXiv Detail & Related papers (2020-10-14T07:10:55Z) - Representation Learning for Sequence Data with Deep Autoencoding
Predictive Components [96.42805872177067]
We propose a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space.
We encourage this latent structure by maximizing an estimate of predictive information of latent feature sequences, which is the mutual information between past and future windows at each time step.
We demonstrate that our method recovers the latent space of noisy dynamical systems, extracts predictive features for forecasting tasks, and improves automatic speech recognition when used to pretrain the encoder on large amounts of unlabeled data.
arXiv Detail & Related papers (2020-10-07T03:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.