Cross-view Masked Diffusion Transformers for Person Image Synthesis
- URL: http://arxiv.org/abs/2402.01516v2
- Date: Mon, 3 Jun 2024 14:53:20 GMT
- Title: Cross-view Masked Diffusion Transformers for Person Image Synthesis
- Authors: Trung X. Pham, Zhang Kang, Chang D. Yoo,
- Abstract summary: We present X-MDPT, a novel diffusion model designed for pose-guided human image generation.
X-MDPT distinguishes itself by employing masked diffusion transformers that operate on latent patches.
Our model outperforms state-of-the-art approaches on the DeepFashion dataset.
- Score: 21.242398582282522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present X-MDPT ($\underline{Cross}$-view $\underline{M}$asked $\underline{D}$iffusion $\underline{P}$rediction $\underline{T}$ransformers), a novel diffusion model designed for pose-guided human image generation. X-MDPT distinguishes itself by employing masked diffusion transformers that operate on latent patches, a departure from the commonly-used Unet structures in existing works. The model comprises three key modules: 1) a denoising diffusion Transformer, 2) an aggregation network that consolidates conditions into a single vector for the diffusion process, and 3) a mask cross-prediction module that enhances representation learning with semantic information from the reference image. X-MDPT demonstrates scalability, improving FID, SSIM, and LPIPS with larger models. Despite its simple design, our model outperforms state-of-the-art approaches on the DeepFashion dataset while exhibiting efficiency in terms of training parameters, training time, and inference speed. Our compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent diffusion approach (FID 8.07) using only $11\times$ fewer parameters. Our best model surpasses the pixel-based diffusion with $\frac{2}{3}$ of the parameters and achieves $5.43 \times$ faster inference. The code is available at https://github.com/trungpx/xmdpt.
Related papers
- Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model [101.65105730838346]
We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data.
We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data.
Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens.
arXiv Detail & Related papers (2024-08-20T17:48:20Z) - SparseDM: Toward Sparse Efficient Diffusion Models [20.783533300147866]
We propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models.
Experiments on four datasets conducted on a state-of-the-art Transformer-based diffusion model demonstrate that our method reduces MACs by $50%$ while increasing FID by only 1.5 on average.
arXiv Detail & Related papers (2024-04-16T10:31:06Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - DFormer: Diffusion-guided Transformer for Universal Image Segmentation [86.73405604947459]
The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model.
At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks.
Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val 2017 set.
arXiv Detail & Related papers (2023-06-06T06:33:32Z) - SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two
Seconds [88.06788636008051]
Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers.
These models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run.
We present a generic approach that unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds.
arXiv Detail & Related papers (2023-06-01T17:59:25Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - I$^2$SB: Image-to-Image Schr\"odinger Bridge [87.43524087956457]
Image-to-Image Schr"odinger Bridge (I$2$SB) is a new class of conditional diffusion models.
I$2$SB directly learns the nonlinear diffusion processes between two given distributions.
We show that I$2$SB surpasses standard conditional diffusion models with more interpretable generative processes.
arXiv Detail & Related papers (2023-02-12T08:35:39Z) - Designing BERT for Convolutional Networks: Sparse and Hierarchical
Masked Modeling [23.164631160130092]
We extend the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets)
We treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode.
This is the first use of sparse convolution for 2D masked modeling.
arXiv Detail & Related papers (2023-01-09T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.