MonoFormer: One Transformer for Both Diffusion and Autoregression
- URL: http://arxiv.org/abs/2409.16280v1
- Date: Tue, 24 Sep 2024 17:51:04 GMT
- Title: MonoFormer: One Transformer for Both Diffusion and Autoregression
- Authors: Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang,
- Abstract summary: We propose to study a simple idea: share one transformer for both autoregression and diffusion.
Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods.
- Score: 70.81047437281583
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.
Related papers
- Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models [39.234894330025114]
Text-guided image-to-image diffusion models excel in translating images based on textual prompts.
This motivates us to introduce the task of origin IDentification for text-guided Image-to-image Diffusion models (ID$2$)
A straightforward solution to ID$2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images.
arXiv Detail & Related papers (2025-01-04T20:34:53Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.
We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.
We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - Show-o: One Single Transformer to Unify Multimodal Understanding and Generation [24.58881004205822]
We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation.
Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities.
arXiv Detail & Related papers (2024-08-22T16:32:32Z) - CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion [36.95767748269613]
We propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion.
CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation.
Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0% in human evaluations.
arXiv Detail & Related papers (2024-03-08T07:32:50Z) - On the Multi-modal Vulnerability of Diffusion Models [56.08923332178462]
We propose MMP-Attack to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt.
Our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object.
arXiv Detail & Related papers (2024-02-02T12:39:49Z) - InfoDiffusion: Information Entropy Aware Diffusion Process for
Non-Autoregressive Text Generation [33.52794666968048]
We propose InfoDiffusion, a non-autoregressive text diffusion model.
Our approach introduces a "keyinfo-first" generation strategy and incorporates a noise schedule based on the amount of text information.
Experimental results show that InfoDiffusion outperforms the baseline model in terms of generation quality and diversity.
arXiv Detail & Related papers (2023-10-18T14:01:39Z) - MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation [104.03166324080917]
We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation.
Our method is training-free and does not rely on any label supervision.
Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models.
arXiv Detail & Related papers (2023-09-22T17:59:42Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.