MOHO: Learning Single-view Hand-held Object Reconstruction with
Multi-view Occlusion-Aware Supervision
- URL: http://arxiv.org/abs/2310.11696v2
- Date: Wed, 13 Mar 2024 07:39:10 GMT
- Title: MOHO: Learning Single-view Hand-held Object Reconstruction with
Multi-view Occlusion-Aware Supervision
- Authors: Chenyangguang Zhang, Guanlong Jiao, Yan Di, Gu Wang, Ziqin Huang,
Ruida Zhang, Fabian Manhardt, Bowen Fu, Federico Tombari, Xiangyang Ji
- Abstract summary: We present a novel framework to exploit Multi-view Occlusion-aware supervision from hand-object videos for Hand-held Object reconstruction.
We tackle two predominant challenges in such setting: hand-induced occlusion and object's self-occlusion.
Experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin.
- Score: 75.38953287579616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous works concerning single-view hand-held object reconstruction
typically rely on supervision from 3D ground-truth models, which are hard to
collect in real world. In contrast, readily accessible hand-object videos offer
a promising training data source, but they only give heavily occluded object
observations. In this paper, we present a novel synthetic-to-real framework to
exploit Multi-view Occlusion-aware supervision from hand-object videos for
Hand-held Object reconstruction (MOHO) from a single image, tackling two
predominant challenges in such setting: hand-induced occlusion and object's
self-occlusion. First, in the synthetic pre-training stage, we render a
large-scaled synthetic dataset SOMVideo with hand-object images and multi-view
occlusion-free supervisions, adopted to address hand-induced occlusion in both
2D and 3D spaces. Second, in the real-world finetuning stage, MOHO leverages
the amodal-mask-weighted geometric supervision to mitigate the unfaithful
guidance caused by the hand-occluded supervising views in real world. Moreover,
domain-consistent occlusion-aware features are amalgamated in MOHO to resist
object's self-occlusion for inferring the complete object shape. Extensive
experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains
superior results against 3D-supervised methods by a large margin.
Related papers
- Object-level Scene Deocclusion [92.39886029550286]
We present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, for object-level scene deocclusion.
To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning.
Experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin.
arXiv Detail & Related papers (2024-06-11T20:34:10Z) - In-Hand 3D Object Reconstruction from a Monocular RGB Video [17.31419675163019]
Our work aims to reconstruct a 3D object that is held and rotated by a hand in front of a static RGB camera.
Previous methods that use implicit neural representations to recover the geometry of a generic hand-held object from multi-view images achieved compelling results in the visible part of the object.
arXiv Detail & Related papers (2023-12-27T06:19:25Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction [74.49121940466675]
We introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction.
First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm.
Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions.
arXiv Detail & Related papers (2023-11-23T20:14:50Z) - SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction [13.417086460511696]
We introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes.
We consider a rigid hand-object scenario, in which the pose of the hand with respect to the object remains constant during the whole video sequence.
This assumption allows us to register sub-millimetre-precise groundtruth 3D scans to the image sequences in SHOWMe.
arXiv Detail & Related papers (2023-09-19T16:48:29Z) - 3D Reconstruction of Objects in Hands without Real World 3D Supervision [12.70221786947807]
We propose modules to leverage 3D supervision to scale up the learning of models for reconstructing hand-held objects.
Specifically, we extract multiview 2D mask supervision from videos and 3D shape priors from shape collections.
We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image.
arXiv Detail & Related papers (2023-05-04T17:56:48Z) - Unsupervised Style-based Explicit 3D Face Reconstruction from Single
Image [10.1205208477163]
In this work, we propose a general adversarial learning framework for solving Unsupervised 2D to Explicit 3D Style Transfer.
Specifically, we merge two architectures: the unsupervised explicit 3D reconstruction network of Wu et al. and the Generative Adversarial Network (GAN) named StarGAN-v2.
We show that our solution is able to outperform well established solutions such as DepthNet in 3D reconstruction and Pix2NeRF in conditional style transfer.
arXiv Detail & Related papers (2023-04-24T21:25:06Z) - Towards unconstrained joint hand-object reconstruction from RGB videos [81.97694449736414]
Reconstructing hand-object manipulations holds a great potential for robotics and learning from human demonstrations.
We first propose a learning-free fitting approach for hand-object reconstruction which can seamlessly handle two-hand object interactions.
arXiv Detail & Related papers (2021-08-16T12:26:34Z) - Reconstructing Hand-Object Interactions in the Wild [71.16013096764046]
We propose an optimization-based procedure which does not require direct 3D supervision.
We exploit all available related data (2D bounding boxes, 2D hand keypoints, 2D instance masks, 3D object models, 3D in-the-lab MoCap) to provide constraints for the 3D reconstruction.
Our method produces compelling reconstructions on the challenging in-the-wild data from the EPIC Kitchens and the 100 Days of Hands datasets.
arXiv Detail & Related papers (2020-12-17T18:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.