REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
- URL: http://arxiv.org/abs/2504.10483v3
- Date: Wed, 22 Oct 2025 10:52:47 GMT
- Title: REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
- Authors: Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng,
- Abstract summary: Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible.<n>For latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective.<n>We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss.
- Score: 52.55041244336767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.
Related papers
- VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training [53.09658039757408]
This paper proposes textbfnamex, a lightweight intrinsic guidance framework for efficient diffusion training.<n>name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss.<n>Experiments demonstrate that name improves both generation quality and training convergence speed compared to vanilla diffusion transformers.
arXiv Detail & Related papers (2026-01-25T13:22:38Z) - Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training [22.94826927321741]
Recent works have shown that guiding diffusion models with external semantic features can significantly accelerate the training of diffusion transformers (DiTs)<n>We propose bfSelf-Transcendence, a method that achieves fast convergence using internal feature supervision only.
arXiv Detail & Related papers (2026-01-12T17:52:11Z) - Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model [53.77953728335891]
Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network.<n>We propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space.<n>This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion.
arXiv Detail & Related papers (2025-11-18T17:58:16Z) - ADT: Tuning Diffusion Models with Adversarial Supervision [16.974169058917443]
Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions.
We propose Adrial Diffusion Tuning (ADT) to stimulate the inference process during optimization and align the final outputs with training data.
ADT features a siamese-network discriminator with a fixed pre-trained backbone and lightweight trainable parameters.
arXiv Detail & Related papers (2025-04-15T17:37:50Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.
To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.
Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - ProReflow: Progressive Reflow with Decomposed Velocity [52.249464542399636]
Flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation.
We introduce progressive reflow, which progressively reflows the diffusion models in local timesteps until the whole diffusion progresses.
We also introduce aligned v-prediction, which highlights the importance of direction matching in flow matching over magnitude matching.
arXiv Detail & Related papers (2025-03-05T04:50:53Z) - Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.<n>We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.<n>The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners.
DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders.
Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.