GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation
- URL: http://arxiv.org/abs/2409.11689v1
- Date: Wed, 18 Sep 2024 04:05:59 GMT
- Title: GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation
- Authors: Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang,
- Abstract summary: We propose a framework with GUNet as the main model, PoseDiffusion.
It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model.
Results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation.
- Score: 7.0646249774097525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.
Related papers
- Free-viewpoint Human Animation with Pose-correlated Reference Selection [31.429327964922184]
Diffusion-based human animation aims to animate a human character based on a source human image as well as driving signals such as a sequence of poses.
Existing approaches are able to generate high-fidelity poses, but struggle with significant viewpoint changes.
We propose a pose-correlated reference selection diffusion network, supporting substantial viewpoint variations in human animation.
arXiv Detail & Related papers (2024-12-23T05:22:44Z) - From Text to Pose to Image: Improving Diffusion Model Control and Quality [0.5183511047901651]
We introduce a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity.
Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models.
arXiv Detail & Related papers (2024-11-19T21:34:50Z) - GRPose: Learning Graph Relations for Human Image Generation with Pose Priors [21.91374799527015]
We propose a framework that delves into the graph relations of pose priors to provide control information for human image generation.
The main idea is to establish a graph topological structure between the pose priors and latent representation of diffusion models.
A pose perception loss is introduced based on a pretrained pose estimation network to minimize the pose differences.
arXiv Detail & Related papers (2024-08-29T13:58:34Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Pose Modulated Avatars from Video [22.395774558845336]
We develop a two-branch neural network that is adaptive and explicit in the frequency domain.
The first branch is a graph neural network that models correlations among body parts locally.
The second branch combines these correlation features to a set of global frequencies and then modulates the feature encoding.
arXiv Detail & Related papers (2023-08-23T06:49:07Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - SinDiffusion: Learning a Diffusion Model from a Single Natural Image [159.4285444680301]
We present SinDiffusion, leveraging denoising diffusion models to capture internal distribution of patches from a single natural image.
It is based on two core designs. First, SinDiffusion is trained with a single model at a single scale instead of multiple models with progressive growing of scales.
Second, we identify that a patch-level receptive field of the diffusion network is crucial and effective for capturing the image's patch statistics.
arXiv Detail & Related papers (2022-11-22T18:00:03Z) - OCD: Learning to Overfit with Conditional Diffusion Models [95.1828574518325]
We present a dynamic model in which the weights are conditioned on an input sample x.
We learn to match those weights that would be obtained by finetuning a base model on x and its label y.
arXiv Detail & Related papers (2022-10-02T09:42:47Z) - DANBO: Disentangled Articulated Neural Body Representations via Graph
Neural Networks [12.132886846993108]
High-resolution models enable photo-realistic avatars but at the cost of requiring studio settings not available to end users.
Our goal is to create avatars directly from raw images without relying on expensive studio setups and surface tracking.
We introduce a three-stage method that induces two inductive biases to better disentangled pose-dependent deformation.
arXiv Detail & Related papers (2022-05-03T17:56:46Z) - Neural Rendering of Humans in Novel View and Pose from Monocular Video [68.37767099240236]
We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input.
Our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.
arXiv Detail & Related papers (2022-04-04T03:09:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.