One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
- URL: http://arxiv.org/abs/2303.06555v2
- Date: Tue, 30 May 2023 17:42:56 GMT
- Title: One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
- Authors: Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang
Yue, Yue Cao, Hang Su, Jun Zhu
- Abstract summary: This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model.
Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model.
- Score: 36.590918776922905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit
all distributions relevant to a set of multi-modal data in one model. Our key
insight is -- learning diffusion models for marginal, conditional, and joint
distributions can be unified as predicting the noise in the perturbed data,
where the perturbation levels (i.e. timesteps) can be different for different
modalities. Inspired by the unified view, UniDiffuser learns all distributions
simultaneously with a minimal modification to the original diffusion model --
perturbs data in all modalities instead of a single modality, inputs individual
timesteps in different modalities, and predicts the noise of all modalities
instead of a single modality. UniDiffuser is parameterized by a transformer for
diffusion models to handle input types of different modalities. Implemented on
large-scale paired image-text data, UniDiffuser is able to perform image, text,
text-to-image, image-to-text, and image-text pair generation by setting proper
timesteps without additional overhead. In particular, UniDiffuser is able to
produce perceptually realistic samples in all tasks and its quantitative
results (e.g., the FID and CLIP score) are not only superior to existing
general-purpose models but also comparable to the bespoken models (e.g., Stable
Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image
generation).
Related papers
- TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers [34.611309081801345]
This paper focuses on enabling a single pre-trained diffusion transformer model to scale across multiple datasets swiftly.
We propose DiffScaler, an efficient scaling strategy for diffusion models where we train a minimal amount of parameters to adapt to different tasks.
We find that transformer-based diffusion models significantly outperform CNN-based diffusion models methods while performing fine-tuning over smaller datasets.
arXiv Detail & Related papers (2024-04-15T17:55:43Z) - Boosting Diffusion Models with Moving Average Sampling in Frequency Domain [101.43824674873508]
Diffusion models rely on the current sample to denoise the next one, possibly resulting in denoising instability.
In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples.
We name the complete approach "Moving Average Sampling in Frequency domain (MASF)"
arXiv Detail & Related papers (2024-03-26T16:57:55Z) - Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model [36.57703763466984]
We propose an advanced selective hourglass mapping strategy based on diffusion model, DiffUIR.
We achieve state-of-the-art performance on five image restoration tasks, 22 benchmarks in the universal setting and zero-shot generalization setting.
arXiv Detail & Related papers (2024-03-17T09:41:20Z) - Diffusion Random Feature Model [0.0]
We present a diffusion model-inspired deep random feature model that is interpretable.
We derive generalization bounds between the distribution of sampled data and the true distribution using properties of score matching.
We validate our findings by generating samples on the fashion MNIST dataset and instrumental audio data.
arXiv Detail & Related papers (2023-10-06T17:59:05Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion
Models [54.1843419649895]
We propose a solution based on denoising diffusion probabilistic models (DDPMs)
Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models.
Our method can unite multiple diffusion models trained on multiple sub-tasks and conquer the combined task.
arXiv Detail & Related papers (2022-12-01T18:59:55Z) - f-DM: A Multi-stage Diffusion Model via Progressive Signal
Transformation [56.04628143914542]
Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains.
We propose f-DM, a generalized family of DMs which allows progressive signal transformation.
We apply f-DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations.
arXiv Detail & Related papers (2022-10-10T18:49:25Z) - Image Generation with Multimodal Priors using Denoising Diffusion
Probabilistic Models [54.1843419649895]
A major challenge in using generative models to accomplish this task is the lack of paired data containing all modalities and corresponding outputs.
We propose a solution based on a denoising diffusion probabilistic synthesis models to generate images under multi-model priors.
arXiv Detail & Related papers (2022-06-10T12:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.