Learning Hand-Held Object Reconstruction from In-The-Wild Videos
- URL: http://arxiv.org/abs/2305.03036v1
- Date: Thu, 4 May 2023 17:56:48 GMT
- Title: Learning Hand-Held Object Reconstruction from In-The-Wild Videos
- Authors: Aditya Prakash, Matthew Chang, Matthew Jin, Saurabh Gupta
- Abstract summary: We learn data-driven 3D shape priors using synthetic objects from the ObMan dataset.
We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image.
- Score: 19.16274394098004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior works for reconstructing hand-held objects from a single image rely on
direct 3D shape supervision which is challenging to gather in real world at
scale. Consequently, these approaches do not generalize well when presented
with novel objects in in-the-wild settings. While 3D supervision is a major
bottleneck, there is an abundance of in-the-wild raw video data showing
hand-object interactions. In this paper, we automatically extract 3D
supervision (via multiview 2D supervision) from such raw video data to scale up
the learning of models for hand-held object reconstruction. This requires
tackling two key challenges: unknown camera pose and occlusion. For the former,
we use hand pose (predicted from existing techniques, e.g. FrankMocap) as a
proxy for object pose. For the latter, we learn data-driven 3D shape priors
using synthetic objects from the ObMan dataset. We use these indirect 3D cues
to train occupancy networks that predict the 3D shape of objects from a single
RGB image. Our experiments on the MOW and HO3D datasets show the effectiveness
of these supervisory signals at predicting the 3D shape for real-world
hand-held objects without any direct real-world 3D supervision.
Related papers
- Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - Reconstructing Hand-Held Objects in 3D [53.277402172488735]
We present a paradigm for handheld object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets.
We use GPT-4(V) to retrieve a 3D object model that matches the object in the image and rigidly align the model to the network-inferred geometry.
Experiments demonstrate that MCC-HO achieves state-of-the-art performance on lab and Internet datasets.
arXiv Detail & Related papers (2024-04-09T17:55:41Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - Decaf: Monocular Deformation Capture for Face and Hand Interactions [77.75726740605748]
This paper introduces the first method that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos.
We model hands as articulated objects inducing non-rigid face deformations during an active interaction.
Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system.
arXiv Detail & Related papers (2023-09-28T17:59:51Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - Unsupervised object-centric video generation and decomposition in 3D [36.08064849807464]
We propose to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background.
Our model is trained from monocular videos without any supervision, yet learns to generate coherent 3D scenes containing several moving objects.
arXiv Detail & Related papers (2020-07-07T18:01:29Z) - From Image Collections to Point Clouds with Self-supervised Shape and
Pose Networks [53.71440550507745]
Reconstructing 3D models from 2D images is one of the fundamental problems in computer vision.
We propose a deep learning technique for 3D object reconstruction from a single image.
We learn both 3D point cloud reconstruction and pose estimation networks in a self-supervised manner.
arXiv Detail & Related papers (2020-05-05T04:25:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.