Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in
Time
- URL: http://arxiv.org/abs/2106.05266v1
- Date: Wed, 9 Jun 2021 17:59:34 GMT
- Title: Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in
Time
- Authors: Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, Xiaolong Wang
- Abstract summary: We propose a unified framework for estimating the 3D hand and object poses with semi-supervised learning.
We build a joint learning framework where we perform explicit contextual reasoning between hand and object representations by a Transformer.
Our method not only improves hand pose estimation in challenging real-world dataset, but also substantially improve the object pose which has fewer ground-truths per instance.
- Score: 22.574069344246052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating 3D hand and object pose from a single image is an extremely
challenging problem: hands and objects are often self-occluded during
interactions, and the 3D annotations are scarce as even humans cannot directly
label the ground-truths from a single image perfectly. To tackle these
challenges, we propose a unified framework for estimating the 3D hand and
object poses with semi-supervised learning. We build a joint learning framework
where we perform explicit contextual reasoning between hand and object
representations by a Transformer. Going beyond limited 3D annotations in a
single image, we leverage the spatial-temporal consistency in large-scale
hand-object videos as a constraint for generating pseudo labels in
semi-supervised learning. Our method not only improves hand pose estimation in
challenging real-world dataset, but also substantially improve the object pose
which has fewer ground-truths per instance. By training with large-scale
diverse videos, our model also generalizes better across multiple out-of-domain
datasets. Project page and code: https://stevenlsw.github.io/Semi-Hand-Object
Related papers
- Reconstructing Hand-Held Objects in 3D from Images and Videos [53.277402172488735]
Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time.
We present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image.
We then prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image.
arXiv Detail & Related papers (2024-04-09T17:55:41Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - What's in your hands? 3D Reconstruction of Generic Objects in Hands [49.12461675219253]
Our work aims to reconstruct hand-held objects given a single RGB image.
In contrast to prior works that typically assume known 3D templates and reduce the problem to 3D pose estimation, our work reconstructs generic hand-held object without knowing their 3D templates.
arXiv Detail & Related papers (2022-04-14T17:59:02Z) - H2O: Two Hands Manipulating Objects for First Person Interaction
Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects.
Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame.
Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z) - Perceiving 3D Human-Object Spatial Arrangements from a Single Image in
the Wild [96.08358373137438]
We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene.
Our method runs on datasets without any scene- or object-level 3D supervision.
arXiv Detail & Related papers (2020-07-30T17:59:50Z) - Leveraging Photometric Consistency over Time for Sparsely Supervised
Hand-Object Reconstruction [118.21363599332493]
We present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video.
Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses.
We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy.
arXiv Detail & Related papers (2020-04-28T12:03:14Z) - Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and
Objects for 3D Hand Pose Estimation under Hand-Object Interaction [137.28465645405655]
HANDS'19 is a challenge to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set.
We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set.
arXiv Detail & Related papers (2020-03-30T19:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.