CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware
Prompting
- URL: http://arxiv.org/abs/2309.16140v1
- Date: Thu, 28 Sep 2023 03:40:37 GMT
- Title: CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware
Prompting
- Authors: Shaoxiang Guo, Qing Cai, Lin Qi, Junyu Dong
- Abstract summary: We make one of the first attempts to propose a novel 3D hand pose estimator from monocular images, dubbed as CLIP-Hand3D.
We maximize semantic consistency for a pair of pose-text features following a CLIP-based contrastive learning paradigm.
Experiments on several public hand benchmarks show that the proposed model attains a significantly faster inference speed.
- Score: 38.678165053219644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) starts to emerge in many
computer vision tasks and has achieved promising performance. However, it
remains underexplored whether CLIP can be generalized to 3D hand pose
estimation, as bridging text prompts with pose-aware features presents
significant challenges due to the discrete nature of joint positions in 3D
space. In this paper, we make one of the first attempts to propose a novel 3D
hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which
successfully bridges the gap between text prompts and irregular detailed pose
distribution. In particular, the distribution order of hand joints in various
3D space directions is derived from pose labels, forming corresponding text
prompts that are subsequently encoded into text representations.
Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial
distribution (in x, y, and z axes) is encoded to form pose-aware features.
Subsequently, we maximize semantic consistency for a pair of pose-text features
following a CLIP-based contrastive learning paradigm. Furthermore, a
coarse-to-fine mesh regressor is designed, which is capable of effectively
querying joint-aware cues from the feature pyramid. Extensive experiments on
several public hand benchmarks show that the proposed model attains a
significantly faster inference speed while achieving state-of-the-art
performance compared to methods utilizing the similar scale backbone.
Related papers
- UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose
Estimation [18.72362803593654]
The dominant paradigm in 3D human pose estimation that lifts a 2D pose sequence to 3D heavily relies on long-term temporal clues.
This can be attributed to their inherent inability to perceive spatial context as plain 2D joint coordinates carry no visual cues.
We propose a straightforward yet powerful solution: leveraging the readily available intermediate visual representations produced by off-the-shelf (pre-trained) 2D pose detectors.
arXiv Detail & Related papers (2023-11-06T18:04:13Z) - What's in your hands? 3D Reconstruction of Generic Objects in Hands [49.12461675219253]
Our work aims to reconstruct hand-held objects given a single RGB image.
In contrast to prior works that typically assume known 3D templates and reduce the problem to 3D pose estimation, our work reconstructs generic hand-held object without knowing their 3D templates.
arXiv Detail & Related papers (2022-04-14T17:59:02Z) - 3D Hand Pose and Shape Estimation from RGB Images for Improved
Keypoint-Based Hand-Gesture Recognition [25.379923604213626]
This paper presents a keypoint-based end-to-end framework for the 3D hand and pose estimation.
It is successfully applied to the hand-gesture recognition task as a study case.
arXiv Detail & Related papers (2021-09-28T17:07:43Z) - MM-Hand: 3D-Aware Multi-Modal Guided Hand Generative Network for 3D Hand
Pose Synthesis [81.40640219844197]
Estimating the 3D hand pose from a monocular RGB image is important but challenging.
A solution is training on large-scale RGB hand images with accurate 3D hand keypoint annotations.
We have developed a learning-based approach to synthesize realistic, diverse, and 3D pose-preserving hand images.
arXiv Detail & Related papers (2020-10-02T18:27:34Z) - Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation.
We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation.
We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.