Related papers: Latent Denoising Makes Good Visual Tokenizers

Latent Denoising Makes Good Visual Tokenizers

URL: http://arxiv.org/abs/2507.15856v1
Date: Mon, 21 Jul 2025 17:59:56 GMT
Title: Latent Denoising Makes Good Visual Tokenizers
Authors: Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang,
Abstract summary: We introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking.<n>Our experiments on ImageNet 256x256 demonstrate that our tokenizer consistently outperforms standard tokenizers across six representative generative models.
Score: 20.267773446610377
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite their fundamental role, it remains unclear what properties could make visual tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs such as Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings to be more easily reconstructed even when heavily corrupted. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on ImageNet 256x256 demonstrate that our tokenizer consistently outperforms standard tokenizers across six representative generative models. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.

Related papers

Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation [110.28291466364784]
Speculative Jacobi-Denoising Decoding (SJD2) is a framework that incorporates the denoising process into Jacobi to enable parallel token generation in autoregressive models.<n>Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings.
arXiv Detail & Related papers (2025-10-10T04:30:45Z)
Image Tokenizer Needs Post-Training [76.91832192778732]
We propose a novel tokenizer training scheme, focusing on improving latent space construction and decoding respectively.<n>Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer.<n>We further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens.
arXiv Detail & Related papers (2025-09-15T21:38:03Z)
Revealing the Implicit Noise-based Imprint of Generative Models [71.94916898756684]
This paper presents a novel framework that leverages noise-based model-specific imprint for the detection task.<n>By aggregating imprints from various generative models, imprints of future models can be extrapolated to expand training data.<n>Our approach achieves state-of-the-art performance across three public benchmarks including GenImage, Synthbuster and Chameleon.
arXiv Detail & Related papers (2025-03-12T12:04:53Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis [57.7367843129838]
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer.<n>We propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction.
arXiv Detail & Related papers (2025-03-11T12:09:11Z)
MIGA: Mutual Information-Guided Attack on Denoising Models for Semantic Manipulation [39.12448251986432]
We propose Mutual Information-Guided Attack (MIGA) to directly attack deep denoising models.<n>MIGA strategically disrupts denoising models' ability to preserve semantic content via adversarial perturbations.<n>Our findings suggest that denoising models are not always robust and can introduce security risks in real-world applications.
arXiv Detail & Related papers (2025-03-10T06:26:34Z)
Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models [34.02500148392666]
We investigate the impact of additive noise on pre-training deep networks.<n>We find three critical conditions: corruption and restoration must be applied within the encoder, noise must be introduced in the feature space, and an explicit disentanglement between noised and masked tokens is necessary.
arXiv Detail & Related papers (2024-12-26T07:47:20Z)
Reconstruct-and-Generate Diffusion Model for Detail-Preserving Image Denoising [16.43285056788183]
We propose a novel approach called the Reconstruct-and-Generate Diffusion Model (RnG) Our method leverages a reconstructive denoising network to recover the majority of the underlying clean signal. It employs a diffusion algorithm to generate residual high-frequency details, thereby enhancing visual quality.
arXiv Detail & Related papers (2023-09-19T16:01:20Z)
Masked Image Training for Generalizable Deep Image Denoising [53.03126421917465]
We present a novel approach to enhance the generalization performance of denoising networks. Our method involves masking random pixels of the input image and reconstructing the missing information during training. Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios.
arXiv Detail & Related papers (2023-03-23T09:33:44Z)
CurvPnP: Plug-and-play Blind Image Restoration with Deep Curvature Denoiser [7.442030347967277]
existing plug-and-play image restoration methods are designed for non-blind denoising. We propose a novel framework with blind prior, which can deal with more complicated image restoration problems in the real world. Our model is shown to be able to recover the fine image details tiny structures even when the noise level is different.
arXiv Detail & Related papers (2022-11-14T11:30:24Z)
Dual Adversarial Network: Toward Real-world Noise Removal and Noise Generation [52.75909685172843]
Real-world image noise removal is a long-standing yet very challenging task in computer vision. We propose a novel unified framework to deal with the noise removal and noise generation tasks. Our method learns the joint distribution of the clean-noisy image pairs.
arXiv Detail & Related papers (2020-07-12T09:16:06Z)
Reconstructing the Noise Manifold for Image Denoising [56.562855317536396]
We introduce the idea of a cGAN which explicitly leverages structure in the image noise space. By learning directly a low dimensional manifold of the image noise, the generator promotes the removal from the noisy image only that information which spans this manifold. Based on our experiments, our model substantially outperforms existing state-of-the-art architectures.
arXiv Detail & Related papers (2020-02-11T00:31:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.