Human Mesh Recovery from Multiple Shots
- URL: http://arxiv.org/abs/2012.09843v1
- Date: Thu, 17 Dec 2020 18:58:02 GMT
- Title: Human Mesh Recovery from Multiple Shots
- Authors: Georgios Pavlakos, Jitendra Malik, Angjoo Kanazawa
- Abstract summary: We propose a framework for improved 3D reconstruction and mining of long sequences with pseudo ground truth 3D human mesh.
We show that the resulting data is beneficial in the training of various human mesh recovery models.
The tools we develop open the door to processing and analyzing in 3D content from a large library of edited media.
- Score: 85.18244937708356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos from edited media like movies are a useful, yet under-explored source
of information. The rich variety of appearance and interactions between humans
depicted over a large temporal context in these films could be a valuable
source of data. However, the richness of data comes at the expense of
fundamental challenges such as abrupt shot changes and close up shots of actors
with heavy truncation, which limits the applicability of existing human 3D
understanding methods. In this paper, we address these limitations with an
insight that while shot changes of the same scene incur a discontinuity between
frames, the 3D structure of the scene still changes smoothly. This allows us to
handle frames before and after the shot change as multi-view signal that
provide strong cues to recover the 3D state of the actors. We propose a
multi-shot optimization framework, which leads to improved 3D reconstruction
and mining of long sequences with pseudo ground truth 3D human mesh. We show
that the resulting data is beneficial in the training of various human mesh
recovery models: for single image, we achieve improved robustness; for video we
propose a pure transformer-based temporal encoder, which can naturally handle
missing observations due to shot changes in the input frames. We demonstrate
the importance of the insight and proposed models through extensive
experiments. The tools we develop open the door to processing and analyzing in
3D content from a large library of edited media, which could be helpful for
many downstream applications. Project page:
https://geopavlakos.github.io/multishot
Related papers
- ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model [16.14713604672497]
ReconX is a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task.
The proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition.
Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency.
arXiv Detail & Related papers (2024-08-29T17:59:40Z) - MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - Decaf: Monocular Deformation Capture for Face and Hand Interactions [77.75726740605748]
This paper introduces the first method that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos.
We model hands as articulated objects inducing non-rigid face deformations during an active interaction.
Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system.
arXiv Detail & Related papers (2023-09-28T17:59:51Z) - Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular
Videos in the Wild [10.849750765175754]
POTR-3D is a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE.
It robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs.
arXiv Detail & Related papers (2023-09-15T06:17:22Z) - Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction
Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips.
Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape.
We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z) - PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point
Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework.
Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion.
We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z) - Differentiable Blocks World: Qualitative 3D Decomposition by Rendering
Primitives [70.32817882783608]
We present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives.
Unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images.
We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points.
arXiv Detail & Related papers (2023-07-11T17:58:31Z) - Visibility Aware Human-Object Interaction Tracking from Single RGB
Camera [40.817960406002506]
We propose a novel method to track the 3D human, object, contacts between them, and their relative translation across frames from a single RGB camera.
We condition our neural field reconstructions for human and object on per-frame SMPL model estimates obtained by pre-fitting SMPL to a video sequence.
Human and object motion from visible frames provides valuable information to infer the occluded object.
arXiv Detail & Related papers (2023-03-29T06:23:44Z) - BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown
Objects [89.2314092102403]
We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence.
Our method works for arbitrary rigid objects, even when visual texture is largely absent.
arXiv Detail & Related papers (2023-03-24T17:13:49Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.