Related papers: VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training

VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training

URL: http://arxiv.org/abs/2601.17830v1
Date: Sun, 25 Jan 2026 13:22:38 GMT
Title: VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training
Authors: Mengmeng Wang, Dengyang Jiang, Liuzhuozheng Li, Yucheng Lin, Guojiang Shen, Xiangjie Kong, Yong Liu, Guang Dai, Jingdong Wang,
Abstract summary: This paper proposes textbfnamex, a lightweight intrinsic guidance framework for efficient diffusion training.<n>name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss.<n>Experiments demonstrate that name improves both generation quality and training convergence speed compared to vanilla diffusion transformers.
Score: 53.09658039757408
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes \textbf{\namex}, a lightweight intrinsic guidance framework for efficient diffusion training. \name leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, \name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that \name improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4\% extra GFLOPs with zero additional cost for external guidance models.

Related papers

Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training [22.94826927321741]
Recent works have shown that guiding diffusion models with external semantic features can significantly accelerate the training of diffusion transformers (DiTs)<n>We propose bfSelf-Transcendence, a method that achieves fast convergence using internal feature supervision only.
arXiv Detail & Related papers (2026-01-12T17:52:11Z)
ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion [7.233066974580282]
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution.<n>Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models.<n>We propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training.
arXiv Detail & Related papers (2025-10-29T17:17:32Z)
No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves [59.79343544931784]
Self-Representation Alignment (SRA) is a simple yet straightforward method that obtains representation guidance through a self-distillation manner.<n> Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements.
arXiv Detail & Related papers (2025-05-05T17:58:05Z)
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers [52.55041244336767]
Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible.<n>For latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective.<n>We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss.
arXiv Detail & Related papers (2025-04-14T17:59:53Z)
One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.<n>We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.<n>The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.