UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
- URL: http://arxiv.org/abs/2411.16781v1
- Date: Mon, 25 Nov 2024 08:06:30 GMT
- Title: UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
- Authors: Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen,
- Abstract summary: We present UniPose, a framework to comprehend, generate, and edit human poses across various modalities.
Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary.
Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities.
- Score: 79.68232381605661
- License:
- Abstract: Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.
Related papers
- UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics [74.10447111842504]
UniReal is a unified framework designed to address various image generation and editing tasks.
Inspired by recent video generation models, we propose a unifying approach that treats image-level tasks as discontinuous video generation.
Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision.
arXiv Detail & Related papers (2024-12-10T18:59:55Z) - PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation [38.958695275774616]
We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities.
We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation.
arXiv Detail & Related papers (2024-09-10T14:09:39Z) - QPoser: Quantized Explicit Pose Prior Modeling for Controllable Pose
Generation [27.93210245241248]
A desirable explicit pose prior model should satisfy three desirable abilities.
QPoser is a controllable explicit pose prior model which guarantees correctness and expressiveness.
QPoser significantly outperforms state-of-the-art approaches in representing expressive and correct poses.
arXiv Detail & Related papers (2023-12-02T10:44:34Z) - ChatPose: Chatting about 3D Human Pose [47.70287492050979]
ChatPose is a framework to understand and reason about 3D human poses from images or textual descriptions.
Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - VINECS: Video-based Neural Character Skinning [82.39776643541383]
We propose a fully automated approach for creating a fully rigged character with pose-dependent skinning weights.
We show that our approach outperforms state-of-the-art while not relying on dense 4D scans.
arXiv Detail & Related papers (2023-07-03T08:35:53Z) - PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar
Modeling [30.93155530590843]
We present PoseVocab, a novel pose encoding method that can encode high-fidelity human details.
Given multi-view RGB videos of a character, PoseVocab constructs key poses and latent embeddings based on the training poses.
Experiments show that our method outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-04-25T17:25:36Z) - Real-Time Neural Character Rendering with Pose-Guided Multiplane Images [75.62730144924566]
We propose pose-guided multiplane image (MPI) synthesis which can render an animatable character in real scenes with photorealistic quality.
We use a portable camera rig to capture the multi-view images along with the driving signal for the moving subject.
arXiv Detail & Related papers (2022-04-25T17:51:38Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.