Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction
Clips
- URL: http://arxiv.org/abs/2309.05663v1
- Date: Mon, 11 Sep 2023 17:58:30 GMT
- Title: Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction
Clips
- Authors: Yufei Ye, Poorvi Hebbar, Abhinav Gupta, Shubham Tulsiani
- Abstract summary: We tackle the task of reconstructing hand-object interactions from short video clips.
Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape.
We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
- Score: 38.02945794078731
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We tackle the task of reconstructing hand-object interactions from short
video clips. Given an input video, our approach casts 3D inference as a
per-video optimization and recovers a neural 3D representation of the object
shape, as well as the time-varying motion and hand articulation. While the
input video naturally provides some multi-view cues to guide 3D inference,
these are insufficient on their own due to occlusions and limited viewpoint
variations. To obtain accurate 3D, we augment the multi-view signals with
generic data-driven priors to guide reconstruction. Specifically, we learn a
diffusion network to model the conditional distribution of (geometric)
renderings of objects conditioned on hand configuration and category label, and
leverage it as a prior to guide the novel-view renderings of the reconstructed
scene. We empirically evaluate our approach on egocentric videos across 6
object categories, and observe significant improvements over prior single-view
and multi-view methods. Finally, we demonstrate our system's ability to
reconstruct arbitrary clips from YouTube, showing both 1st and 3rd person
interactions.
Related papers
- MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - Multi-view Inversion for 3D-aware Generative Adversarial Networks [3.95944314850151]
Current 3D GAN inversion methods for human heads typically use only one single frontal image to reconstruct the whole 3D head model.
This leaves out meaningful information when multi-view data or dynamic videos are available.
Our method builds on existing state-of-the-art 3D GAN inversion techniques to allow for consistent and simultaneous inversion of multiple views of the same subject.
arXiv Detail & Related papers (2023-12-08T19:28:40Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Multiview Compressive Coding for 3D Reconstruction [77.95706553743626]
We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
arXiv Detail & Related papers (2023-01-19T18:59:52Z) - IVT: An End-to-End Instance-guided Video Transformer for 3D Pose
Estimation [6.270047084514142]
Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos.
IVT enables learningtemporal contextual depth information from visual features and 3D poses directly from video frames.
Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.
arXiv Detail & Related papers (2022-08-06T02:36:33Z) - Reconstructing and grounding narrated instructional videos in 3D [99.22297066405741]
We aim to reconstruct such objects and to localize associated narrations in 3D.
We propose an approach for correspondence estimation combining learnt local features and dense flow.
We demonstrate the effectiveness of our approach for the domain of car maintenance.
arXiv Detail & Related papers (2021-09-09T16:49:10Z) - Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies
from Single RGB Images [5.775625085664381]
We introduce an approach that accurately reconstructs 3D human poses and detailed 3D full-body geometric models from single images in realtime.
Key idea of our approach is a novel end-to-end multi-task deep learning framework that uses single images to predict five outputs simultaneously.
We show the system advances the frontier of 3D human body and pose reconstruction from single images by quantitative evaluations and comparisons with state-of-the-art methods.
arXiv Detail & Related papers (2021-06-22T04:26:11Z) - Learning monocular 3D reconstruction of articulated categories from
motion [39.811816510186475]
Video self-supervision forces the consistency of consecutive 3D reconstructions by a motion-based cycle loss.
We introduce an interpretable model of 3D template deformations that controls a 3D surface through the displacement of a small number of local, learnable handles.
We obtain state-of-the-art reconstructions with diverse shapes, viewpoints and textures for multiple articulated object categories.
arXiv Detail & Related papers (2021-03-30T13:50:27Z) - Human Mesh Recovery from Multiple Shots [85.18244937708356]
We propose a framework for improved 3D reconstruction and mining of long sequences with pseudo ground truth 3D human mesh.
We show that the resulting data is beneficial in the training of various human mesh recovery models.
The tools we develop open the door to processing and analyzing in 3D content from a large library of edited media.
arXiv Detail & Related papers (2020-12-17T18:58:02Z) - Online Adaptation for Consistent Mesh Reconstruction in the Wild [147.22708151409765]
We pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video.
We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild.
arXiv Detail & Related papers (2020-12-06T07:22:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.