Replay: Multi-modal Multi-view Acted Videos for Casual Holography
- URL: http://arxiv.org/abs/2307.12067v1
- Date: Sat, 22 Jul 2023 12:24:07 GMT
- Title: Replay: Multi-modal Multi-view Acted Videos for Casual Holography
- Authors: Roman Shapovalov, Yanir Kleiman, Ignacio Rocco, David Novotny, Andrea
Vedaldi, Changan Chen, Filippos Kokkinos, Ben Graham, Natalia Neverova
- Abstract summary: Replay is a collection of multi-view, multi-modal videos of humans interacting socially.
Overall, the dataset contains over 4000 minutes of footage and over 7 million timestamped high-resolution frames.
The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models.
- Score: 76.49914880351167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Replay, a collection of multi-view, multi-modal videos of humans
interacting socially. Each scene is filmed in high production quality, from
different viewpoints with several static cameras, as well as wearable action
cameras, and recorded with a large array of microphones at different positions
in the room. Overall, the dataset contains over 4000 minutes of footage and
over 7 million timestamped high-resolution frames annotated with camera poses
and partially with foreground masks. The Replay dataset has many potential
applications, such as novel-view synthesis, 3D reconstruction, novel-view
acoustic synthesis, human body and face analysis, and training generative
models. We provide a benchmark for training and evaluating novel-view
synthesis, with two scenarios of different difficulty. Finally, we evaluate
several baseline state-of-the-art methods on the new benchmark.
Related papers
- PIV3CAMS: a multi-camera dataset for multiple computer vision problems and its application to novel view-point synthesis [120.4361056355332]
This thesis introduces Paired Image and Video data from three CAMeraS, namely PIV3CAMS.
The PIV3CAMS dataset consists of 8385 pairs of images and 82 pairs of videos taken from three different cameras.
In addition to the regeneration of a current state-of-the-art algorithm, we investigate several proposed alternative models that integrate depth information geometrically.
arXiv Detail & Related papers (2024-07-26T12:18:29Z) - Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis [43.02778060969546]
We propose a controllable monocular dynamic view synthesis pipeline.
Our model does not require depth as input, and does not explicitly model 3D scene geometry.
We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
arXiv Detail & Related papers (2024-05-23T17:59:52Z) - Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras [65.54875149514274]
We present the first approach to render highly realistic free-viewpoint videos of a human actor in general apparel.
At inference, our method only requires four camera views of the moving actor and the respective 3D skeletal pose.
It handles actors in wide clothing, and reproduces even fine-scale dynamic detail.
arXiv Detail & Related papers (2023-12-12T16:45:52Z) - Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis [76.72505510632904]
We present Total-Recon, the first method to reconstruct deformable scenes from long monocular RGBD videos.
Our method hierarchically decomposes the scene into the background and objects, whose motion is decomposed into root-body motion and local articulations.
arXiv Detail & Related papers (2023-04-24T17:59:52Z) - Deep 3D Mask Volume for View Synthesis of Dynamic Scenes [49.45028543279115]
We introduce a multi-view video dataset, captured with a custom 10-camera rig in 120FPS.
The dataset contains 96 high-quality scenes showing various visual effects and human interactions in outdoor scenes.
We develop a new algorithm, Deep 3D Mask Volume, which enables temporally-stable view extrapolation from binocular videos of dynamic scenes, captured by static cameras.
arXiv Detail & Related papers (2021-08-30T17:55:28Z) - DeepMultiCap: Performance Capture of Multiple Characters Using Sparse
Multiview Cameras [63.186486240525554]
DeepMultiCap is a novel method for multi-person performance capture using sparse multi-view cameras.
Our method can capture time varying surface details without the need of using pre-scanned template models.
arXiv Detail & Related papers (2021-05-01T14:32:13Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.