SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization
- URL: http://arxiv.org/abs/2510.04961v1
- Date: Mon, 06 Oct 2025 15:57:31 GMT
- Title: SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization
- Authors: Théophane Vallaeys, Jakob Verbeek, Matthieu Cord,
- Abstract summary: We introduce a new pixel diffusion decoder architecture for improved scaling and training stability.<n>We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder.<n>This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses.
- Score: 56.12853087022071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.
Related papers
- MeanFlow Transformers with Representation Autoencoders [71.45823902973349]
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data.<n>We develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE)<n>We achieve a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256.
arXiv Detail & Related papers (2025-11-17T06:17:08Z) - Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression [36.10674664089876]
SODEC is a novel single-step diffusion-based image compression model.<n>It improves fidelity resulting from over-reliance on generative priors.<n>It significantly outperforms existing methods, achieving superior rate-distortion-perception performance.
arXiv Detail & Related papers (2025-08-07T02:24:03Z) - Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training [14.058527210122831]
We propose a Transfer VAE Training (TVT) strategy to transfer the 8$times$ downsampled VAE into a 4$times$ one while adapting to the pre-trained UNet.<n>TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details.
arXiv Detail & Related papers (2025-07-27T14:11:29Z) - DDT: Decoupled Diffusion Transformer [51.84206763079382]
Diffusion transformers encode noisy inputs to extract semantic component and decode higher frequency with identical modules.<n>textbfcolorddtDecoupled textbfcolorddtTransformer(textbfcolorddtDDT)<n>textbfcolorddtTransformer(textbfcolorddtDDT)<n>textbfcolorddtTransformer(textbfcolorddtDDT)
arXiv Detail & Related papers (2025-04-08T07:17:45Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Adversarial Diffusion Compression for Real-World Image Super-Resolution [16.496532580598007]
Real-world image super-resolution aims to reconstruct high-resolution images from degraded low-resolution inputs.<n>One-step diffusion networks like OSEDiff and S3Diff alleviate this issue but still incur high computational costs.<n>This paper proposes a novel Real-ISR method, AdcSR, by distilling the one-step diffusion network OSEDiff into a streamlined diffusion-GAN model.
arXiv Detail & Related papers (2024-11-20T15:13:36Z) - Epsilon-VAE: Denoising as Visual Decoding [61.29255979767292]
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement.<n>Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image.<n>By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality.
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks.
We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation.
We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.