Generative Model-Based Loss to the Rescue: A Method to Overcome
Annotation Errors for Depth-Based Hand Pose Estimation
- URL: http://arxiv.org/abs/2007.03073v2
- Date: Sun, 30 May 2021 11:36:43 GMT
- Title: Generative Model-Based Loss to the Rescue: A Method to Overcome
Annotation Errors for Depth-Based Hand Pose Estimation
- Authors: Jiayi Wang, Franziska Mueller, Florian Bernard, Christian Theobalt
- Abstract summary: We propose to use a model-based generative loss for training hand pose estimators on depth images based on a volumetric hand model.
This additional loss allows training of a hand pose estimator that accurately infers the entire set of 21 hand keypoints while only using supervision for 6 easy-to-annotate keypoints (fingertips and wrist).
- Score: 76.12736932610163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose to use a model-based generative loss for training hand pose
estimators on depth images based on a volumetric hand model. This additional
loss allows training of a hand pose estimator that accurately infers the entire
set of 21 hand keypoints while only using supervision for 6 easy-to-annotate
keypoints (fingertips and wrist). We show that our partially-supervised method
achieves results that are comparable to those of fully-supervised methods which
enforce articulation consistency. Moreover, for the first time we demonstrate
that such an approach can be used to train on datasets that have erroneous
annotations, i.e. "ground truth" with notable measurement errors, while
obtaining predictions that explain the depth images better than the given
"ground truth".
Related papers
- Weakly-Supervised 3D Hand Reconstruction with Knowledge Prior and Uncertainty Guidance [27.175214956244798]
Fully-supervised monocular 3D hand reconstruction is often difficult because capturing the requisite 3D data entails deploying specialized equipment in a controlled environment.
We introduce a weakly-supervised method that avoids such requirements by leveraging fundamental principles well-established in the understanding of the human hand's unique structure and functionality.
Our method achieves nearly a 21% performance improvement on the widely adopted FreiHAND dataset.
arXiv Detail & Related papers (2024-07-17T04:05:34Z) - Self-supervised 3D Human Pose Estimation from a Single Image [1.0878040851638]
We propose a new self-supervised method for predicting 3D human body pose from a single image.
The prediction network is trained from a dataset of unlabelled images depicting people in typical poses and a set of unpaired 2D poses.
arXiv Detail & Related papers (2023-04-05T10:26:21Z) - Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation [59.3035531612715]
Existing methods often struggle to generate plausible hand poses when the hand is heavily occluded or blurred.
In videos, the movements of the hand allow us to observe various parts of the hand that may be occluded or blurred in a single frame.
We propose the Deformer: a framework that implicitly reasons about the relationship between hand parts within the same image.
arXiv Detail & Related papers (2023-03-09T02:24:30Z) - Monitored Distillation for Positive Congruent Depth Completion [13.050141729551585]
We propose a method to infer a dense depth map from a single image, its calibration, and the associated sparse point cloud.
In order to leverage existing models that produce putative depth maps (teacher models), we propose an adaptive knowledge distillation approach.
We consider the scenario of a blind ensemble where we do not have access to ground truth for model selection nor training.
arXiv Detail & Related papers (2022-03-30T03:35:56Z) - Predict, Prevent, and Evaluate: Disentangled Text-Driven Image
Manipulation Empowered by Pre-Trained Vision-Language Model [168.04947140367258]
We propose a novel framework, i.e., Predict, Prevent, and Evaluate (PPE), for disentangled text-driven image manipulation.
Our method approaches the targets by exploiting the power of the large scale pre-trained vision-language model CLIP.
Extensive experiments show that the proposed PPE framework achieves much better quantitative and qualitative results than the up-to-date StyleCLIP baseline.
arXiv Detail & Related papers (2021-11-26T06:49:26Z) - Adversarial Motion Modelling helps Semi-supervised Hand Pose Estimation [116.07661813869196]
We propose to combine ideas from adversarial training and motion modelling to tap into unlabeled videos.
We show that an adversarial leads to better properties of the hand pose estimator via semi-supervised training on unlabeled video sequences.
The main advantage of our approach is that we can make use of unpaired videos and joint sequence data both of which are much easier to attain than paired training data.
arXiv Detail & Related papers (2021-06-10T17:50:19Z) - Calibrating Self-supervised Monocular Depth Estimation [77.77696851397539]
In the recent years, many methods demonstrated the ability of neural networks to learn depth and pose changes in a sequence of images, using only self-supervision as the training signal.
We show that incorporating prior information about the camera configuration and the environment, we can remove the scale ambiguity and predict depth directly, still using the self-supervised formulation and not relying on any additional sensors.
arXiv Detail & Related papers (2020-09-16T14:35:45Z) - Self-Supervised Learning for Monocular Depth Estimation from Aerial
Imagery [0.20072624123275526]
We present a method for self-supervised learning for monocular depth estimation from aerial imagery.
For this, we only use an image sequence from a single moving camera and learn to simultaneously estimate depth and pose information.
By sharing the weights between pose and depth estimation, we achieve a relatively small model, which favors real-time application.
arXiv Detail & Related papers (2020-08-17T12:20:46Z) - Leveraging Photometric Consistency over Time for Sparsely Supervised
Hand-Object Reconstruction [118.21363599332493]
We present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video.
Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses.
We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy.
arXiv Detail & Related papers (2020-04-28T12:03:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.