Visibility Aware Human-Object Interaction Tracking from Single RGB
Camera
- URL: http://arxiv.org/abs/2303.16479v2
- Date: Tue, 31 Oct 2023 16:27:27 GMT
- Title: Visibility Aware Human-Object Interaction Tracking from Single RGB
Camera
- Authors: Xianghui Xie and Bharat Lal Bhatnagar and Gerard Pons-Moll
- Abstract summary: We propose a novel method to track the 3D human, object, contacts between them, and their relative translation across frames from a single RGB camera.
We condition our neural field reconstructions for human and object on per-frame SMPL model estimates obtained by pre-fitting SMPL to a video sequence.
Human and object motion from visible frames provides valuable information to infer the occluded object.
- Score: 40.817960406002506
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Capturing the interactions between humans and their environment in 3D is
important for many applications in robotics, graphics, and vision. Recent works
to reconstruct the 3D human and object from a single RGB image do not have
consistent relative translation across frames because they assume a fixed
depth. Moreover, their performance drops significantly when the object is
occluded. In this work, we propose a novel method to track the 3D human,
object, contacts between them, and their relative translation across frames
from a single RGB camera, while being robust to heavy occlusions. Our method is
built on two key insights. First, we condition our neural field reconstructions
for human and object on per-frame SMPL model estimates obtained by pre-fitting
SMPL to a video sequence. This improves neural reconstruction accuracy and
produces coherent relative translation across frames. Second, human and object
motion from visible frames provides valuable information to infer the occluded
object. We propose a novel transformer-based neural network that explicitly
uses object visibility and human motion to leverage neighbouring frames to make
predictions for the occluded frames. Building on these insights, our method is
able to track both human and object robustly even under occlusions. Experiments
on two datasets show that our method significantly improves over the
state-of-the-art methods. Our code and pretrained models are available at:
https://virtualhumans.mpi-inf.mpg.de/VisTracker
Related papers
- InterTrack: Tracking Human Object Interaction without Object Templates [34.31283776812698]
We present a method to track human object interaction without any object shape templates.
We decompose the 4D tracking problem into per-frame pose tracking and canonical shape optimization.
Our method significantly outperforms previous template-based video tracking and single-frame reconstruction methods.
arXiv Detail & Related papers (2024-08-25T22:26:46Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - DORT: Modeling Dynamic Objects in Recurrent for Multi-Camera 3D Object
Detection and Tracking [67.34803048690428]
We propose to model Dynamic Objects in RecurrenT (DORT) to tackle this problem.
DORT extracts object-wise local volumes for motion estimation that also alleviates the heavy computational burden.
It is flexible and practical that can be plugged into most camera-based 3D object detectors.
arXiv Detail & Related papers (2023-03-29T12:33:55Z) - BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown
Objects [89.2314092102403]
We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence.
Our method works for arbitrary rigid objects, even when visual texture is largely absent.
arXiv Detail & Related papers (2023-03-24T17:13:49Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z) - PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to 6 DoF
Tracking [27.283648727847268]
We present a method for tracking the 6D motion of objects in RGB video sequences when neither the training images nor the 3D geometry of the objects are available.
In contrast to previous works, our method can therefore consider unknown objects in open world instantly.
Our results on challenging datasets are on par with previous works that require much more information.
arXiv Detail & Related papers (2022-09-15T19:55:13Z) - Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes.
We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature.
We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z) - CHORE: Contact, Human and Object REconstruction from a single RGB image [40.817960406002506]
CHORE is a novel method that learns to jointly reconstruct the human and the object from a single RGB image.
We compute a neural reconstruction of human and object represented implicitly with two unsigned distance fields.
Experiments show that our joint reconstruction learned with the proposed strategy significantly outperforms the SOTA.
arXiv Detail & Related papers (2022-04-05T18:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.