Novel View Video Prediction Using a Dual Representation
- URL: http://arxiv.org/abs/2106.03956v1
- Date: Mon, 7 Jun 2021 20:41:33 GMT
- Title: Novel View Video Prediction Using a Dual Representation
- Authors: Sarah Shiraz, Krishna Regmi, Shruti Vyas, Yogesh S. Rawat, Mubarak
Shah
- Abstract summary: Given a set of input video clips from a single/multiple views, our network is able to predict the video from a novel view.
The proposed approach does not require any priors and is able to predict the video from wider angular distances, upto 45 degree.
A comparison with the State-of-the-art novel view video prediction methods shows an improvement of 26.1% in SSIM, 13.6% in PSNR, and 60% inFVD scores without using explicit priors from target views.
- Score: 51.58657840049716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address the problem of novel view video prediction; given a set of input
video clips from a single/multiple views, our network is able to predict the
video from a novel view. The proposed approach does not require any priors and
is able to predict the video from wider angular distances, upto 45 degree, as
compared to the recent studies predicting small variations in viewpoint.
Moreover, our method relies only onRGB frames to learn a dual representation
which is used to generate the video from a novel viewpoint. The dual
representation encompasses a view-dependent and a global representation which
incorporates complementary details to enable novel view video prediction. We
demonstrate the effectiveness of our framework on two real world datasets:
NTU-RGB+D and CMU Panoptic. A comparison with the State-of-the-art novel view
video prediction methods shows an improvement of 26.1% in SSIM, 13.6% in PSNR,
and 60% inFVD scores without using explicit priors from target views.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Revisiting Feature Prediction for Learning Visual Representations from Video [62.08833572467379]
V-JEPA is a collection of vision models trained solely using a feature prediction objective.
The models are trained on 2 million videos collected from public datasets.
Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks.
arXiv Detail & Related papers (2024-02-15T18:59:11Z) - Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information [45.31198546289057]
This paper presents and proposes a novel approach, named Saliency and Trajectory Viewport Prediction (STVP)
It aims to improve the precision of viewport prediction in volumetric video streaming.
In particular, we introduce a novel sampling method, Uniform Random Sampling (URS), to reduce computational complexity.
arXiv Detail & Related papers (2023-11-28T03:45:29Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Learning video embedding space with Natural Language Supervision [1.6822770693792823]
We propose a novel approach to map video embedding space to natural langugage.
We propose a two-stage approach that first extracts visual features from each frame of a video using a pre-trained CNN, and then uses the CLIP model to encode the visual features for the video domain.
arXiv Detail & Related papers (2023-03-25T23:24:57Z) - Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes.
Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset.
Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Interventional Video Grounding with Dual Contrastive Learning [16.0734337895897]
Video grounding aims to localize a moment from an untrimmed video for a given textual query.
We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data.
We also introduce a dual contrastive learning approach to better align the text and video.
arXiv Detail & Related papers (2021-06-21T12:11:28Z) - Sequential View Synthesis with Transformer [13.200139959163574]
We introduce a sequential rendering decoder to predict an image sequence, including the target view, based on the learned representations.
We evaluate our model on various challenging datasets and demonstrate that our model not only gives consistent predictions but also doesn't require any retraining for finetuning.
arXiv Detail & Related papers (2020-04-09T14:15:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.