Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation
- URL: http://arxiv.org/abs/2403.04381v2
- Date: Sat, 9 Mar 2024 11:02:48 GMT
- Title: Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation
- Authors: Ruicong Liu, Takehiko Ohkawa, Mingfang Zhang, Yoichi Sato
- Abstract summary: We propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views.
S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings.
- Score: 16.95807780754898
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The pursuit of accurate 3D hand pose estimation stands as a keystone for
understanding human activity in the realm of egocentric vision. The majority of
existing estimation methods still rely on single-view images as input, leading
to potential limitations, e.g., limited field-of-view and ambiguity in depth.
To address these problems, adding another camera to better capture the shape of
hands is a practical direction. However, existing multi-view hand pose
estimation methods suffer from two main drawbacks: 1) Requiring multi-view
annotations for training, which are expensive. 2) During testing, the model
becomes inapplicable if camera parameters/layout are not the same as those used
in training. In this paper, we propose a novel Single-to-Dual-view adaptation
(S2DHand) solution that adapts a pre-trained single-view estimator to dual
views. Compared with existing multi-view training methods, 1) our adaptation
process is unsupervised, eliminating the need for multi-view annotation. 2)
Moreover, our method can handle arbitrary dual-view pairs with unknown camera
parameters, making the model applicable to diverse camera settings.
Specifically, S2DHand is built on certain stereo constraints, including
pair-wise cross-view consensus and invariance of transformation between both
views. These two stereo constraints are used in a complementary manner to
generate pseudo-labels, allowing reliable adaptation. Evaluation results reveal
that S2DHand achieves significant improvements on arbitrary camera pairs under
both in-dataset and cross-dataset settings, and outperforms existing adaptation
methods with leading performance. Project page:
https://github.com/MickeyLLG/S2DHand.
Related papers
- Self-learning Canonical Space for Multi-view 3D Human Pose Estimation [57.969696744428475]
Multi-view 3D human pose estimation is naturally superior to single view one.
The accurate annotation of these information is hard to obtain.
We propose a fully self-supervised framework, named cascaded multi-view aggregating network (CMANet)
CMANet is superior to state-of-the-art methods in extensive quantitative and qualitative analysis.
arXiv Detail & Related papers (2024-03-19T04:54:59Z) - Multi-View Person Matching and 3D Pose Estimation with Arbitrary
Uncalibrated Camera Networks [36.49915280876899]
Cross-view person matching and 3D human pose estimation in multi-camera networks are difficult when the cameras are extrinsically uncalibrated.
Existing efforts require large amounts of 3D data for training neural networks or known camera poses for geometric constraints to solve the problem.
We present a method, PME, that solves the two tasks without requiring either information.
arXiv Detail & Related papers (2023-12-04T01:28:38Z) - Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency [0.493599216374976]
We propose a novel loss function, multiview consistency, to enable adding additional training data with only 2D supervision.
Our experiments demonstrate that two views offset by 90 degrees are enough to obtain good performance, with only marginal improvements by adding more views.
This research introduces new possibilities for domain adaptation in 3D pose estimation, providing a practical and cost-effective solution to customize models for specific applications.
arXiv Detail & Related papers (2023-11-21T08:21:55Z) - CameraPose: Weakly-Supervised Monocular 3D Human Pose Estimation by
Leveraging In-the-wild 2D Annotations [25.05308239278207]
We present CameraPose, a weakly-supervised framework for 3D human pose estimation from a single image.
By adding a camera parameter branch, any in-the-wild 2D annotations can be fed into our pipeline to boost the training diversity.
We also introduce a refinement network module with confidence-guided loss to further improve the quality of noisy 2D keypoints extracted by 2D pose estimators.
arXiv Detail & Related papers (2023-01-08T05:07:41Z) - VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual
Data [69.64723752430244]
We introduce VirtualPose, a two-stage learning framework to exploit the hidden "free lunch" specific to this task.
The first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses.
It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses.
arXiv Detail & Related papers (2022-07-20T14:47:28Z) - MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation.
Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z) - CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the
Wild [31.334715988245748]
We propose a self-supervised approach that learns a single image 3D pose estimator from unlabeled multi-view data.
In contrast to most existing methods, we do not require calibrated cameras and can therefore learn from moving cameras.
Key to the success are new, unbiased reconstruction objectives that mix information across views and training samples.
arXiv Detail & Related papers (2020-11-30T10:42:27Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z) - Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the
Wild [101.70320427145388]
We propose a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data.
We evaluate our proposed approach on two large scale datasets.
arXiv Detail & Related papers (2020-03-17T08:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.