Temporal View Synthesis of Dynamic Scenes through 3D Object Motion
Estimation with Multi-Plane Images
- URL: http://arxiv.org/abs/2208.09463v1
- Date: Fri, 19 Aug 2022 17:40:13 GMT
- Title: Temporal View Synthesis of Dynamic Scenes through 3D Object Motion
Estimation with Multi-Plane Images
- Authors: Nagabhushan Somraj, Pranali Sancheti, Rajiv Soundararajan
- Abstract summary: We study the problem of temporal view synthesis (TVS), where the goal is to predict the next frames of a video.
In this work, we consider the TVS of dynamic scenes in which both the user and objects are moving.
We predict the motion of objects by isolating and estimating the 3D object motion in the past frames and then extrapolating it.
- Score: 8.185918509343816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The challenge of graphically rendering high frame-rate videos on low compute
devices can be addressed through periodic prediction of future frames to
enhance the user experience in virtual reality applications. This is studied
through the problem of temporal view synthesis (TVS), where the goal is to
predict the next frames of a video given the previous frames and the head poses
of the previous and the next frames. In this work, we consider the TVS of
dynamic scenes in which both the user and objects are moving. We design a
framework that decouples the motion into user and object motion to effectively
use the available user motion while predicting the next frames. We predict the
motion of objects by isolating and estimating the 3D object motion in the past
frames and then extrapolating it. We employ multi-plane images (MPI) as a 3D
representation of the scenes and model the object motion as the 3D displacement
between the corresponding points in the MPI representation. In order to handle
the sparsity in MPIs while estimating the motion, we incorporate partial
convolutions and masked correlation layers to estimate corresponding points.
The predicted object motion is then integrated with the given user or camera
motion to generate the next frame. Using a disocclusion infilling module, we
synthesize the regions uncovered due to the camera and object motion. We
develop a new synthetic dataset for TVS of dynamic scenes consisting of 800
videos at full HD resolution. We show through experiments on our dataset and
the MPI Sintel dataset that our model outperforms all the competing methods in
the literature.
Related papers
- DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - Delving into Motion-Aware Matching for Monocular 3D Object Tracking [81.68608983602581]
We find that the motion cue of objects along different time frames is critical in 3D multi-object tracking.
We propose MoMA-M3T, a framework that mainly consists of three motion-aware components.
We conduct extensive experiments on the nuScenes and KITTI datasets to demonstrate our MoMA-M3T achieves competitive performance against state-of-the-art methods.
arXiv Detail & Related papers (2023-08-22T17:53:58Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Tracking by 3D Model Estimation of Unknown Objects in Videos [122.56499878291916]
We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation.
Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames.
The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose.
arXiv Detail & Related papers (2023-04-13T11:32:36Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Motion-from-Blur: 3D Shape and Motion Estimation of Motion-blurred
Objects in Videos [115.71874459429381]
We propose a method for jointly estimating the 3D motion, 3D shape, and appearance of highly motion-blurred objects from a video.
Experiments on benchmark datasets demonstrate that our method outperforms previous methods for fast moving object deblurring and 3D reconstruction.
arXiv Detail & Related papers (2021-11-29T11:25:14Z) - Unsupervised Video Prediction from a Single Frame by Estimating 3D
Dynamic Scene Structure [42.3091008598491]
We develop a model that first estimates the latent 3D structure of the scene, including the segmentation of any moving objects.
It then predicts future frames by simulating the object and camera dynamics, and rendering the resulting views.
Experiments on two challenging datasets of natural videos show that our model can estimate 3D structure and motion segmentation from a single frame.
arXiv Detail & Related papers (2021-06-16T18:00:12Z) - 3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators [24.181604511269096]
We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-in 3D neural scene representation space.
In this space, objects do not interfere with one another and their appearance persists over time and across viewpoints.
We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints.
arXiv Detail & Related papers (2020-11-12T16:15:52Z) - Synergetic Reconstruction from 2D Pose and 3D Motion for Wide-Space
Multi-Person Video Motion Capture in the Wild [3.0015034534260665]
We propose a markerless motion capture method with accuracy and smoothness from multiple cameras.
The proposed method predicts each persons 3D pose and determines bounding box of multi-camera images.
We evaluated the proposed method using various datasets and a real sports field.
arXiv Detail & Related papers (2020-01-16T02:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.