HandMIM: Pose-Aware Self-Supervised Learning for 3D Hand Mesh Estimation
- URL: http://arxiv.org/abs/2307.16061v1
- Date: Sat, 29 Jul 2023 19:46:06 GMT
- Title: HandMIM: Pose-Aware Self-Supervised Learning for 3D Hand Mesh Estimation
- Authors: Zuyan Liu, Gaojie Lin, Congyi Wang, Min Zheng, Feida Zhu
- Abstract summary: We propose a novel self-supervised pre-training strategy for regressing 3D hand mesh parameters.
Our proposed approach, named HandMIM, achieves strong performance on various hand mesh estimation tasks.
- Score: 5.888156950854715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With an enormous number of hand images generated over time, unleashing pose
knowledge from unlabeled images for supervised hand mesh estimation is an
emerging yet challenging topic. To alleviate this issue, semi-supervised and
self-supervised approaches have been proposed, but they are limited by the
reliance on detection models or conventional ResNet backbones. In this paper,
inspired by the rapid progress of Masked Image Modeling (MIM) in visual
classification tasks, we propose a novel self-supervised pre-training strategy
for regressing 3D hand mesh parameters. Our approach involves a unified and
multi-granularity strategy that includes a pseudo keypoint alignment module in
the teacher-student framework for learning pose-aware semantic class tokens.
For patch tokens with detailed locality, we adopt a self-distillation manner
between teacher and student network based on MIM pre-training. To better fit
low-level regression tasks, we incorporate pixel reconstruction tasks for
multi-level representation learning. Additionally, we design a strong pose
estimation baseline using a simple vanilla vision Transformer (ViT) as the
backbone and attach a PyMAF head after tokens for regression. Extensive
experiments demonstrate that our proposed approach, named HandMIM, achieves
strong performance on various hand mesh estimation tasks. Notably, HandMIM
outperforms specially optimized architectures, achieving 6.29mm and 8.00mm
PAVPE (Vertex-Point-Error) on challenging FreiHAND and HO3Dv2 test sets,
respectively, establishing new state-of-the-art records on 3D hand mesh
estimation.
Related papers
- Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering [11.228453237603834]
We present a novel fine-grained multi-view hand mesh reconstruction method that leverages inverse rendering to restore hand poses and intricate details.
We also introduce a novel Hand Albedo and Mesh (HAM) optimization module to refine both the hand mesh and textures.
Our proposed approach outperforms the state-of-the-art methods on both reconstruction accuracy and rendering quality.
arXiv Detail & Related papers (2024-07-08T07:28:24Z) - Mesh Represented Recycle Learning for 3D Hand Pose and Mesh Estimation [3.126179109712709]
We propose a mesh represented recycle learning strategy for 3D hand pose and mesh estimation.
To be specific, a hand pose and mesh estimation model first predicts parametric 3D hand annotations.
Second, synthetic hand images are generated with self-estimated hand mesh representations.
Third, the synthetic hand images are fed into the same model again.
arXiv Detail & Related papers (2023-10-18T09:50:09Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View
Completion [20.121597331207276]
Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm.
In this paper we seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks.
Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks.
arXiv Detail & Related papers (2022-10-19T16:50:36Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z) - Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images [79.70127290464514]
We decompose the task into two stages, i.e. person localization and pose estimation.
And we propose three task-specific graph neural networks for effective message passing.
Our approach achieves state-of-the-art performance on CMU Panoptic and Shelf datasets.
arXiv Detail & Related papers (2021-09-13T11:44:07Z) - Hand Image Understanding via Deep Multi-Task Learning [34.515382305252814]
We propose a novel Hand Image Understanding (HIU) framework to extract comprehensive information of the hand object from a single RGB image.
Our method significantly outperforms the state-of-the-art approaches on various widely-used datasets.
arXiv Detail & Related papers (2021-07-24T16:28:06Z) - Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the
Wild [101.70320427145388]
We propose a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data.
We evaluate our proposed approach on two large scale datasets.
arXiv Detail & Related papers (2020-03-17T08:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.