UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
- URL: http://arxiv.org/abs/2411.16781v1
- Date: Mon, 25 Nov 2024 08:06:30 GMT
- Title: UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
- Authors: Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen,
- Abstract summary: We present UniPose, a framework to comprehend, generate, and edit human poses across various modalities.
Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary.
Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities.
- Score: 79.68232381605661
- License:
- Abstract: Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.
Related papers
- PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation [38.958695275774616]
We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities.
We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation.
arXiv Detail & Related papers (2024-09-10T14:09:39Z) - QPoser: Quantized Explicit Pose Prior Modeling for Controllable Pose
Generation [27.93210245241248]
A desirable explicit pose prior model should satisfy three desirable abilities.
QPoser is a controllable explicit pose prior model which guarantees correctness and expressiveness.
QPoser significantly outperforms state-of-the-art approaches in representing expressive and correct poses.
arXiv Detail & Related papers (2023-12-02T10:44:34Z) - ChatPose: Chatting about 3D Human Pose [47.70287492050979]
ChatPose is a framework to understand and reason about 3D human poses from images or textual descriptions.
Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - VINECS: Video-based Neural Character Skinning [82.39776643541383]
We propose a fully automated approach for creating a fully rigged character with pose-dependent skinning weights.
We show that our approach outperforms state-of-the-art while not relying on dense 4D scans.
arXiv Detail & Related papers (2023-07-03T08:35:53Z) - PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar
Modeling [30.93155530590843]
We present PoseVocab, a novel pose encoding method that can encode high-fidelity human details.
Given multi-view RGB videos of a character, PoseVocab constructs key poses and latent embeddings based on the training poses.
Experiments show that our method outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-04-25T17:25:36Z) - Real-Time Neural Character Rendering with Pose-Guided Multiplane Images [75.62730144924566]
We propose pose-guided multiplane image (MPI) synthesis which can render an animatable character in real scenes with photorealistic quality.
We use a portable camera rig to capture the multi-view images along with the driving signal for the moving subject.
arXiv Detail & Related papers (2022-04-25T17:51:38Z) - Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation.
We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation.
We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.