D3D-HOI: Dynamic 3D Human-Object Interactions from Videos
- URL: http://arxiv.org/abs/2108.08420v1
- Date: Thu, 19 Aug 2021 00:49:01 GMT
- Title: D3D-HOI: Dynamic 3D Human-Object Interactions from Videos
- Authors: Xiang Xu, Hanbyul Joo, Greg Mori, Manolis Savva
- Abstract summary: We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
- Score: 49.38319295373466
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce D3D-HOI: a dataset of monocular videos with ground truth
annotations of 3D object pose, shape and part motion during human-object
interactions. Our dataset consists of several common articulated objects
captured from diverse real-world scenes and camera viewpoints. Each manipulated
object (e.g., microwave oven) is represented with a matching 3D parametric
model. This data allows us to evaluate the reconstruction quality of
articulated objects and establish a benchmark for this challenging task. In
particular, we leverage the estimated 3D human pose for more accurate inference
of the object spatial layout and dynamics. We evaluate this approach on our
dataset, demonstrating that human-object relations can significantly reduce the
ambiguity of articulated object reconstructions from challenging real-world
videos. Code and dataset are available at
https://github.com/facebookresearch/d3d-hoi.
Related papers
- DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Tracking Objects with 3D Representation from Videos [57.641129788552675]
We propose a new 2D Multiple Object Tracking paradigm, called P3DTrack.
With 3D object representation learning from Pseudo 3D object labels in monocular videos, we propose a new 2D MOT paradigm, called P3DTrack.
arXiv Detail & Related papers (2023-06-08T17:58:45Z) - 3D Reconstruction of Objects in Hands without Real World 3D Supervision [12.70221786947807]
We propose modules to leverage 3D supervision to scale up the learning of models for reconstructing hand-held objects.
Specifically, we extract multiview 2D mask supervision from videos and 3D shape priors from shape collections.
We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image.
arXiv Detail & Related papers (2023-05-04T17:56:48Z) - Articulated 3D Human-Object Interactions from RGB Videos: An Empirical
Analysis of Approaches and Challenges [19.21834600205309]
We canonicalize the task of articulated 3D human-object interaction reconstruction from RGB video.
We use five families of methods for this task: 3D plane estimation, 3D cuboid estimation, CAD model fitting, implicit field fitting, and free-form mesh fitting.
Our experiments show that all methods struggle to obtain high accuracy results even when provided ground truth information.
arXiv Detail & Related papers (2022-09-12T21:03:25Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild
with Pose Annotations [0.0]
We introduce the Objectron dataset to advance the state of the art in 3D object detection.
The dataset contains object-centric short videos with pose annotations for nine categories and includes 4 million annotated images in 14,819 annotated videos.
arXiv Detail & Related papers (2020-12-18T00:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.