Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation
- URL: http://arxiv.org/abs/2406.02485v2
- Date: Tue, 05 Nov 2024 09:46:45 GMT
- Title: Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation
- Authors: Jiajun Wang, Morteza Ghahremani, Yitong Li, Björn Ommer, Christian Wachinger,
- Abstract summary: Stable-Pose is a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer.
We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons.
Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet.
- Score: 32.190055780969466
- License:
- Abstract: Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at https://github.com/ai-med/StablePose.
Related papers
- VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - Lifting by Image -- Leveraging Image Cues for Accurate 3D Human Pose
Estimation [10.374944534302234]
"lifting from 2D pose" method has been the dominant approach to 3D Human Pose Estimation (3DHPE)
Rich semantic and texture information in images can contribute to a more accurate "lifting" procedure.
In this paper, we give new insight into the cause of poor generalization problems and the effectiveness of image features.
arXiv Detail & Related papers (2023-12-25T07:50:58Z) - RePoseDM: Recurrent Pose Alignment and Gradient Guidance for Pose Guided Image Synthesis [14.50214193838818]
Pose-guided person image synthesis task requires re-rendering a reference image, which should have a photorealistic appearance and flawless pose transfer.
We propose recurrent pose alignment to provide pose-aligned texture features as conditional guidance.
This helps in learning plausible pose transfer trajectories that result in photorealism and undistorted texture details.
arXiv Detail & Related papers (2023-10-24T15:16:19Z) - PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar
Modeling [30.93155530590843]
We present PoseVocab, a novel pose encoding method that can encode high-fidelity human details.
Given multi-view RGB videos of a character, PoseVocab constructs key poses and latent embeddings based on the training poses.
Experiments show that our method outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-04-25T17:25:36Z) - AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose
Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method.
With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose.
We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z) - Single-view 3D Body and Cloth Reconstruction under Complex Poses [37.86174829271747]
We extend existing implicit function-based models to deal with images of humans with arbitrary poses and self-occluded limbs.
We learn an implicit function that maps the input image to a 3D body shape with a low level of detail.
We then learn a displacement map, conditioned on the smoothed surface, which encodes the high-frequency details of the clothes and body.
arXiv Detail & Related papers (2022-05-09T07:34:06Z) - FixMyPose: Pose Correctional Captioning and Retrieval [67.20888060019028]
We introduce a new captioning dataset named FixMyPose to address automated pose correction systems.
We collect descriptions of correcting a "current" pose to look like a "target" pose.
To avoid ML biases, we maintain a balance across characters with diverse demographics.
arXiv Detail & Related papers (2021-04-04T21:45:44Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z) - Unsupervised 3D Human Pose Representation with Viewpoint and Pose
Disentanglement [63.853412753242615]
Learning a good 3D human pose representation is important for human pose related tasks.
We propose a novel Siamese denoising autoencoder to learn a 3D pose representation.
Our approach achieves state-of-the-art performance on two inherently different tasks.
arXiv Detail & Related papers (2020-07-14T14:25:22Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.