MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer
- URL: http://arxiv.org/abs/2303.14389v2
- Date: Wed, 21 Feb 2024 15:45:20 GMT
- Title: MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer
- Authors: Shanghua Gao, Pan Zhou, Ming-Ming Cheng, Shuicheng Yan
- Abstract summary: diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image.
We propose a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image.
Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT.
- Score: 158.06850125920923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite its success in image synthesis, we observe that diffusion
probabilistic models (DPMs) often lack contextual reasoning ability to learn
the relations among object parts in an image, leading to a slow learning
process. To solve this issue, we propose a Masked Diffusion Transformer (MDT)
that introduces a mask latent modeling scheme to explicitly enhance the DPMs'
ability to contextual relation learning among object semantic parts in an
image. During training, MDT operates in the latent space to mask certain
tokens. Then, an asymmetric diffusion transformer is designed to predict masked
tokens from unmasked ones while maintaining the diffusion generation process.
Our MDT can reconstruct the full information of an image from its incomplete
contextual input, thus enabling it to learn the associated relations among
image tokens. We further improve MDT with a more efficient macro network
structure and training strategy, named MDTv2. Experimental results show that
MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score
of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed
than the previous SOTA DiT. The source code is released at
https://github.com/sail-sg/MDT.
Related papers
- MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation [44.74056930805525]
We introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G.
This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures.
Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$times$ faster than traditional diffusion transformers.
arXiv Detail & Related papers (2024-08-06T17:29:01Z) - Unified Auto-Encoding with Masked Diffusion [15.264296748357157]
We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD)
UMD combines patch-based and noise-based corruption techniques within a single auto-encoding framework.
It achieves strong performance in downstream generative and representation learning tasks.
arXiv Detail & Related papers (2024-06-25T16:24:34Z) - Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers.
Specifically, we randomly mask out a high proportion of patches in diffused input images during training.
Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z) - DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic
Segmentation Using Diffusion Models [68.21154597227165]
We show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model.
Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image.
arXiv Detail & Related papers (2023-03-21T08:43:15Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - MaskGIT: Masked Generative Image Transformer [49.074967597485475]
MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions.
Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
arXiv Detail & Related papers (2022-02-08T23:54:06Z) - DCT-Mask: Discrete Cosine Transform Mask Representation for Instance
Segmentation [50.70679435176346]
We propose a new mask representation by applying the discrete cosine transform(DCT) to encode the high-resolution binary grid mask into a compact vector.
Our method, termed DCT-Mask, could be easily integrated into most pixel-based instance segmentation methods.
arXiv Detail & Related papers (2020-11-19T15:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.