TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training
- URL: http://arxiv.org/abs/2501.04765v1
- Date: Wed, 08 Jan 2025 18:38:25 GMT
- Title: TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training
- Authors: Felix Krause, Timy Phan, Vincent Tao Hu, Björn Ommer,
- Abstract summary: This work aims to improve the training efficiency of the diffusion backbone by using predefined routes that store this information until it is reintroduced to deeper layers of the model.
Unlike most current approaches, TREAD achieves this without architectural modifications.
We show that our method reduces the computational cost and simultaneously boosts model performance on the standard benchmark ImageNet-1K 256 x 256 in class-conditional synthesis.
- Score: 23.54555663670558
- License:
- Abstract: Diffusion models have emerged as the mainstream approach for visual generation. However, these models usually suffer from sample inefficiency and high training costs. This issue is particularly pronounced in the standard diffusion transformer architecture due to its quadratic complexity relative to input length. Recent works have addressed this by reducing the number of tokens processed in the model, often through masking. In contrast, this work aims to improve the training efficiency of the diffusion backbone by using predefined routes that store this information until it is reintroduced to deeper layers of the model, rather than discarding these tokens entirely. Further, we combine multiple routes and introduce an adapted auxiliary loss that accounts for all applied routes. Our method is not limited to the common transformer-based model - it can also be applied to state-space models. Unlike most current approaches, TREAD achieves this without architectural modifications. Finally, we show that our method reduces the computational cost and simultaneously boosts model performance on the standard benchmark ImageNet-1K 256 x 256 in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 9.55x at 400K training iterations compared to DiT and 25.39x compared to the best benchmark performance of DiT at 7M training iterations.
Related papers
- Masked Generative Nested Transformers with Decode Time Scaling [21.34984197218021]
In this work, we aim to address the bottleneck of inference computational efficiency in visual generation algorithms.
We design a decode time model scaling schedule to utilize compute effectively, and we can cache and reuse some of the computation.
Our experiments show that with almost $3times$ less compute than baseline, our model obtains competitive performance.
arXiv Detail & Related papers (2025-02-01T09:41:01Z) - LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers [79.07412045476872]
Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks.
We show that performing the full of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps.
We propose a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations.
arXiv Detail & Related papers (2024-12-17T01:12:35Z) - Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.
We empirically find that this training paradigm limits the one-step generation performance of consistency models.
We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.
We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.
The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - KIND: Knowledge Integration and Diversion in Diffusion Models [40.442303050947395]
We introduce textbfKIND, which performs textbfKnowledge textbfINtegration and textbfDiversion in diffusion models.
KIND redefines traditional pre-training methods by adjusting training objectives from maximizing model performance on current tasks to condensing transferable common knowledge.
Results indicate that KIND achieves state-of-the-art performance compared to other PEFT and learngene methods.
arXiv Detail & Related papers (2024-08-14T07:22:28Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions.
By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z) - Transfer Learning Between Different Architectures Via Weights Injection [0.0]
This work presents a naive algorithm for parameter transfer between different architectures with a computationally cheap injection technique.
The primary objective is to speed up the training of neural networks from scratch.
arXiv Detail & Related papers (2021-01-07T20:42:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.