Related papers: Novel View Video Prediction Using a Dual Representation

Novel View Video Prediction Using a Dual Representation

URL: http://arxiv.org/abs/2106.03956v1
Date: Mon, 7 Jun 2021 20:41:33 GMT
Title: Novel View Video Prediction Using a Dual Representation
Authors: Sarah Shiraz, Krishna Regmi, Shruti Vyas, Yogesh S. Rawat, Mubarak Shah
Abstract summary: Given a set of input video clips from a single/multiple views, our network is able to predict the video from a novel view. The proposed approach does not require any priors and is able to predict the video from wider angular distances, upto 45 degree. A comparison with the State-of-the-art novel view video prediction methods shows an improvement of 26.1% in SSIM, 13.6% in PSNR, and 60% inFVD scores without using explicit priors from target views.
Score: 51.58657840049716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address the problem of novel view video prediction; given a set of input video clips from a single/multiple views, our network is able to predict the video from a novel view. The proposed approach does not require any priors and is able to predict the video from wider angular distances, upto 45 degree, as compared to the recent studies predicting small variations in viewpoint. Moreover, our method relies only onRGB frames to learn a dual representation which is used to generate the video from a novel viewpoint. The dual representation encompasses a view-dependent and a global representation which incorporates complementary details to enable novel view video prediction. We demonstrate the effectiveness of our framework on two real world datasets: NTU-RGB+D and CMU Panoptic. A comparison with the State-of-the-art novel view video prediction methods shows an improvement of 26.1% in SSIM, 13.6% in PSNR, and 60% inFVD scores without using explicit priors from target views.

Related papers

Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction [17.85550556489256]
We propose TAVDiff, a Text-Audio-Visual-conditioned Diffusion Model for video saliency prediction. To effectively utilize text, a large multimodal model is used to generate textual descriptions for video frames. Regarding the auditory modality, it is used as another conditional information for directing the model to focus on salient regions indicated by sounds.
arXiv Detail & Related papers (2025-04-19T11:30:54Z)
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z)
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z)
Revisiting Feature Prediction for Learning Visual Representations from Video [62.08833572467379]
V-JEPA is a collection of vision models trained solely using a feature prediction objective. The models are trained on 2 million videos collected from public datasets. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks.
arXiv Detail & Related papers (2024-02-15T18:59:11Z)
Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information [45.31198546289057]
This paper presents and proposes a novel approach, named Saliency and Trajectory Viewport Prediction (STVP) It aims to improve the precision of viewport prediction in volumetric video streaming. In particular, we introduce a novel sampling method, Uniform Random Sampling (URS), to reduce computational complexity.
arXiv Detail & Related papers (2023-11-28T03:45:29Z)
Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query. Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z)
Learning video embedding space with Natural Language Supervision [1.6822770693792823]
We propose a novel approach to map video embedding space to natural langugage. We propose a two-stage approach that first extracts visual features from each frame of a video using a pre-trained CNN, and then uses the CLIP model to encode the visual features for the video domain.
arXiv Detail & Related papers (2023-03-25T23:24:57Z)
Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes. Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset. Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z)
OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z)
Interventional Video Grounding with Dual Contrastive Learning [16.0734337895897]
Video grounding aims to localize a moment from an untrimmed video for a given textual query. We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data. We also introduce a dual contrastive learning approach to better align the text and video.
arXiv Detail & Related papers (2021-06-21T12:11:28Z)
Sequential View Synthesis with Transformer [13.200139959163574]
We introduce a sequential rendering decoder to predict an image sequence, including the target view, based on the learned representations. We evaluate our model on various challenging datasets and demonstrate that our model not only gives consistent predictions but also doesn't require any retraining for finetuning.
arXiv Detail & Related papers (2020-04-09T14:15:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.