GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation
- URL: http://arxiv.org/abs/2409.11689v1
- Date: Wed, 18 Sep 2024 04:05:59 GMT
- Title: GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation
- Authors: Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang,
- Abstract summary: We propose a framework with GUNet as the main model, PoseDiffusion.
It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model.
Results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation.
- Score: 7.0646249774097525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.
Related papers
- From Text to Pose to Image: Improving Diffusion Model Control and Quality [0.5183511047901651]
We introduce a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity.
Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models.
arXiv Detail & Related papers (2024-11-19T21:34:50Z) - GRPose: Learning Graph Relations for Human Image Generation with Pose Priors [21.971188335727074]
We propose a framework delving into the graph relations of pose priors to provide control information for human image generation.
Our model achieves superior performance, with a 9.98% increase in pose average precision compared to the latest benchmark model.
arXiv Detail & Related papers (2024-08-29T13:58:34Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Introducing Shape Prior Module in Diffusion Model for Medical Image
Segmentation [7.7545714516743045]
We propose an end-to-end framework called VerseDiff-UNet, which leverages the denoising diffusion probabilistic model (DDPM)
Our approach integrates the diffusion model into a standard U-shaped architecture.
We evaluate our method on a single dataset of spine images acquired through X-ray imaging.
arXiv Detail & Related papers (2023-09-12T03:05:00Z) - Pose Modulated Avatars from Video [22.395774558845336]
We develop a two-branch neural network that is adaptive and explicit in the frequency domain.
The first branch is a graph neural network that models correlations among body parts locally.
The second branch combines these correlation features to a set of global frequencies and then modulates the feature encoding.
arXiv Detail & Related papers (2023-08-23T06:49:07Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - SinDiffusion: Learning a Diffusion Model from a Single Natural Image [159.4285444680301]
We present SinDiffusion, leveraging denoising diffusion models to capture internal distribution of patches from a single natural image.
It is based on two core designs. First, SinDiffusion is trained with a single model at a single scale instead of multiple models with progressive growing of scales.
Second, we identify that a patch-level receptive field of the diffusion network is crucial and effective for capturing the image's patch statistics.
arXiv Detail & Related papers (2022-11-22T18:00:03Z) - OCD: Learning to Overfit with Conditional Diffusion Models [95.1828574518325]
We present a dynamic model in which the weights are conditioned on an input sample x.
We learn to match those weights that would be obtained by finetuning a base model on x and its label y.
arXiv Detail & Related papers (2022-10-02T09:42:47Z) - DANBO: Disentangled Articulated Neural Body Representations via Graph
Neural Networks [12.132886846993108]
High-resolution models enable photo-realistic avatars but at the cost of requiring studio settings not available to end users.
Our goal is to create avatars directly from raw images without relying on expensive studio setups and surface tracking.
We introduce a three-stage method that induces two inductive biases to better disentangled pose-dependent deformation.
arXiv Detail & Related papers (2022-05-03T17:56:46Z) - Neural Rendering of Humans in Novel View and Pose from Monocular Video [68.37767099240236]
We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input.
Our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.
arXiv Detail & Related papers (2022-04-04T03:09:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.