Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training
- URL: http://arxiv.org/abs/2601.07773v1
- Date: Mon, 12 Jan 2026 17:52:11 GMT
- Title: Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training
- Authors: Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Ruibin Li, Yujing Sun, Shuaizheng Liu, Lei Zhang,
- Abstract summary: Recent works have shown that guiding diffusion models with external semantic features can significantly accelerate the training of diffusion transformers (DiTs)<n>We propose bfSelf-Transcendence, a method that achieves fast convergence using internal feature supervision only.
- Score: 22.94826927321741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works such as REPA have shown that guiding diffusion models with external semantic features (e.g., DINO) can significantly accelerate the training of diffusion transformers (DiTs). However, this requires the use of pretrained external networks, introducing additional dependencies and reducing flexibility. In this work, we argue that DiTs actually have the power to guide the training of themselves, and propose \textbf{Self-Transcendence}, a simple yet effective method that achieves fast convergence using internal feature supervision only. It is found that the slow convergence in DiT training primarily stems from the difficulty of representation learning in shallow layers. To address this, we initially train the DiT model by aligning its shallow features with the latent representations from the pretrained VAE for a short phase (e.g., 40 epochs), then apply classifier-free guidance to the intermediate features, enhancing their discriminative capability and semantic expressiveness. These enriched internal features, learned entirely within the model, are used as supervision signals to guide a new DiT training. Compared to existing self-contained methods, our approach brings a significant performance boost. It can even surpass REPA in terms of generation quality and convergence speed, but without the need for any external pretrained models. Our method is not only more flexible for different backbones but also has the potential to be adopted for a wider range of diffusion-based generative tasks. The source code of our method can be found at https://github.com/csslc/Self-Transcendence.
Related papers
- VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training [53.09658039757408]
This paper proposes textbfnamex, a lightweight intrinsic guidance framework for efficient diffusion training.<n>name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss.<n>Experiments demonstrate that name improves both generation quality and training convergence speed compared to vanilla diffusion transformers.
arXiv Detail & Related papers (2026-01-25T13:22:38Z) - Diffusion Guidance Is a Controllable Policy Improvement Operator [98.11511661904618]
CFGRL is trained with the simplicity of supervised learning, yet can further improve on the policies in the data.<n>On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance.
arXiv Detail & Related papers (2025-05-29T14:06:50Z) - Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL [0.0]
This paper proposes modular training methods that decouple the guidance module from the diffusion model.<n>Applying two independently trained guidance models, one during training and the other during inference, can significantly reduce normalized score variance.
arXiv Detail & Related papers (2025-05-19T22:51:58Z) - DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning [53.27049077100897]
generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding.<n>This work introduces self-conditioning, a mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers.<n>Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures.
arXiv Detail & Related papers (2025-05-16T08:47:16Z) - No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves [59.79343544931784]
Self-Representation Alignment (SRA) is a simple yet straightforward method that obtains representation guidance through a self-distillation manner.<n> Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements.
arXiv Detail & Related papers (2025-05-05T17:58:05Z) - REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers [52.55041244336767]
Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible.<n>For latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective.<n>We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss.
arXiv Detail & Related papers (2025-04-14T17:59:53Z) - PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity [9.092404060771306]
Diffusion models have shown impressive results in generating high-quality conditional samples.<n>However, existing methods often require additional training or neural function evaluations (NFEs)<n>We propose a novel and efficient method, termed PLADIS, which boosts pre-trained models by leveraging sparse attention.
arXiv Detail & Related papers (2025-03-10T07:23:19Z) - Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.<n>We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.<n>The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - The Information Pathways Hypothesis: Transformers are Dynamic
Self-Ensembles [24.52890377175555]
We propose a general-purpose training strategy for transformers that can reduce both the memory and computational cost of self-attention by 4 to 8 times during training.
We show that an ensemble of sub-models can be formed from the subsampled pathways within a network, which can achieve better performance than its densely attended counterpart.
arXiv Detail & Related papers (2023-06-02T17:28:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.