Related papers: Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

URL: http://arxiv.org/abs/2511.12633v1
Date: Sun, 16 Nov 2025 15:00:32 GMT
Title: Denoising Vision Transformer Autoencoder with Spectral Self-Regularization
Authors: Xunzhi Xiang, Xingye Tian, Guiyu Zhang, Yabo Chen, Shaofeng Zhang, Xuebo Wang, Xin Tao, Qi Fan,
Abstract summary: We show that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models.<n>We propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality.<n>The resulting Denoising-VAE, a ViT-based autoencoder, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence.
Score: 21.85836384863372
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$\times$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$\times$256 benchmark.

Related papers

Improving Reconstruction of Representation Autoencoder [52.817427902597416]
We propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information.<n>Our experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction.
arXiv Detail & Related papers (2026-02-09T13:12:35Z)
Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement [89.99237142387655]
We introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradations.<n>Latent Harmony is a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware reconstruction.<n>Experiments show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.
arXiv Detail & Related papers (2025-10-09T08:54:26Z)
Generative Image Coding with Diffusion Prior [3.127638190046881]
We propose a novel generative coding framework leveraging diffusion priors to enhance compression performance at lows.<n>We show that our method outperforms existing methods in visual fidelity across lows encoder, (2) improves compression performance by up to 79% over H.266/VVC, and (3) offers an efficient solution for AI-generated content while being adaptable to broader content types.
arXiv Detail & Related papers (2025-09-17T07:32:15Z)
Enhancing Variational Autoencoders with Smooth Robust Latent Encoding [54.74721202894622]
Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models.<n>We introduce Smooth Robust Latent VAE, a novel adversarial training framework that boosts both generation quality and robustness.<n>Experiments show that SRL-VAE improves both generation quality, in image reconstruction and text-guided image editing, and robustness, against Nightshade attacks and image editing attacks.
arXiv Detail & Related papers (2025-04-24T03:17:57Z)
One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.<n>We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.<n>We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [33.519892081718716]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers.<n>Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models.<n>We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z)
Epsilon-VAE: Denoising as Visual Decoding [61.29255979767292]
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement.<n>Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image.<n>By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality.
arXiv Detail & Related papers (2024-10-05T08:27:53Z)
LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models [27.795088366122297]
We introduce LiteVAE, a new autoencoder design for latent diffusion models (LDMs)<n> LiteVAE uses the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality.
arXiv Detail & Related papers (2024-05-23T12:06:00Z)
Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations. We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.