Exploring Vision Transformers as Diffusion Learners
- URL: http://arxiv.org/abs/2212.13771v1
- Date: Wed, 28 Dec 2022 10:32:59 GMT
- Title: Exploring Vision Transformers as Diffusion Learners
- Authors: He Cao, Jianan Wang, Tianhe Ren, Xianbiao Qi, Yihao Chen, Yuan Yao,
Lei Zhang
- Abstract summary: We systematically explore vision Transformers as diffusion learners for various generative tasks.
With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods.
We are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution.
- Score: 15.32238726790633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Score-based diffusion models have captured widespread attention and funded
fast progress of recent vision generative tasks. In this paper, we focus on
diffusion model backbone which has been much neglected before. We
systematically explore vision Transformers as diffusion learners for various
generative tasks. With our improvements the performance of vanilla ViT-based
backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods.
We further provide a hypothesis on the implication of disentangling the
generative backbone as an encoder-decoder structure and show proof-of-concept
experiments verifying the effectiveness of a stronger encoder for generative
tasks with ASymmetriC ENcoder Decoder (ASCEND). Our improvements achieve
competitive results on CIFAR-10, CelebA, LSUN, CUB Bird and large-resolution
text-to-image tasks. To the best of our knowledge, we are the first to
successfully train a single diffusion model on text-to-image task beyond 64x64
resolution. We hope this will motivate people to rethink the modeling choices
and the training pipelines for diffusion-based generative models.
Related papers
- Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.
We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.
The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning [6.06616040517684]
DAAG hindsight relabels the agent's past experience by using diffusion models to transform videos.
Large language model orchestrates this autonomous process without requiring human supervision.
Results show that DAAG improves learning of reward detectors, transferring past experience, and acquiring new tasks.
arXiv Detail & Related papers (2024-07-30T13:01:31Z) - U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation [48.40120035775506]
Kolmogorov-Arnold Networks (KANs) reshape the neural network learning via the stack of non-linear learnable activation functions.
We investigate, modify and re-design the established U-Net pipeline by integrating the dedicated KAN layers on the tokenized intermediate representation, termed U-KAN.
We further delved into the potential of U-KAN as an alternative U-Net noise predictor in diffusion models, demonstrating its applicability in generating task-oriented model architectures.
arXiv Detail & Related papers (2024-06-05T04:13:03Z) - Neural Network Parameter Diffusion [50.85251415173792]
Diffusion models have achieved remarkable success in image and video generation.
In this work, we demonstrate that diffusion models can also.
generate high-performing neural network parameters.
arXiv Detail & Related papers (2024-02-20T16:59:03Z) - Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z) - Object-Centric Slot Diffusion [30.722428924152382]
We introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes.
We demonstrate that LSD significantly outperforms state-of-the-art transformer-based decoders.
We also conduct a preliminary investigation into the integration of pre-trained diffusion models in LSD.
arXiv Detail & Related papers (2023-03-20T02:40:16Z) - Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners.
DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders.
Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.