Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think
- URL: http://arxiv.org/abs/2507.01467v1
- Date: Wed, 02 Jul 2025 08:29:18 GMT
- Title: Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think
- Authors: Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li,
- Abstract summary: REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models.<n>We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations.<n>We propose Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising.
- Score: 56.539823627694304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.
Related papers
- REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training [58.33728862521732]
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow.<n>A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later.<n>We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide.<n>We introduce HASTE
arXiv Detail & Related papers (2025-05-22T15:34:33Z) - Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis [3.5900418884504095]
Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between representation learning and generative modeling.<n>Recent Unified SSL methods rely solely on semantic token reconstruction, which requires an external tokenizer during training.<n>We introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective.
arXiv Detail & Related papers (2025-03-19T09:53:11Z) - Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards [52.90573877727541]
reinforcement learning (RL) has been considered for diffusion model fine-tuning.<n>RL's effectiveness is limited by the challenge of sparse reward.<n>$textB2text-DiffuRL$ is compatible with existing optimization algorithms.
arXiv Detail & Related papers (2025-03-14T09:45:19Z) - When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization [92.17160980120404]
We introduce Causally Regularized Tokenization (CRT), which uses knowledge of the stage 2 generation modeling procedure to embed useful inductive biases in stage 1 latents.<n>CRT makes stage 1 reconstruction performance worse, but makes stage 2 generation performance better by making the tokens easier to model.<n>We match state-of-the-art discrete autoregressive ImageNet generation (2.18 FID) with less than half the tokens per image.
arXiv Detail & Related papers (2024-12-20T20:32:02Z) - E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling [17.62612090885471]
ECAR (Efficient Continuous Auto-Regressive Image Generation via Multistage Modeling) is presented.<n>It operates by generating tokens at increasing resolutions while simultaneously denoising the image at each stage.<n>ECAR achieves comparable image quality to DiT Peebles & Xie [2023] while requiring 10$times$ FLOPs reduction and 5$times$ speedup to generate a 256$times $256 image.
arXiv Detail & Related papers (2024-12-18T18:59:53Z) - Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.<n>We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.<n>The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs [13.133574069588896]
YOSO is a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis with high training stability and mode coverage.<n>We show that our method can serve as a one-step generation model training from scratch with competitive performance.<n>In particular, we show that the YOSO-PixArt-$alpha$ can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without extra explicit training, requiring only 10 A800 days for fine-tuning.
arXiv Detail & Related papers (2024-03-19T17:34:27Z) - ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions.
By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.