Video Extrapolation in Space and Time
- URL: http://arxiv.org/abs/2205.02084v2
- Date: Thu, 5 May 2022 08:13:10 GMT
- Title: Video Extrapolation in Space and Time
- Authors: Yunzhi Zhang and Jiajun Wu
- Abstract summary: We study the problem of Video Extrapolation in Space and Time (VEST)
We propose a model that leverages the self-supervision and the complementary cues from both tasks.
Our method achieves performance better than or comparable to several state-of-the-art NVS and VP methods.
- Score: 10.755019286246979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Novel view synthesis (NVS) and video prediction (VP) are typically considered
disjoint tasks in computer vision. However, they can both be seen as ways to
observe the spatial-temporal world: NVS aims to synthesize a scene from a new
point of view, while VP aims to see a scene from a new point of time. These two
tasks provide complementary signals to obtain a scene representation, as
viewpoint changes from spatial observations inform depth, and temporal
observations inform the motion of cameras and individual objects. Inspired by
these observations, we propose to study the problem of Video Extrapolation in
Space and Time (VEST). We propose a model that leverages the self-supervision
and the complementary cues from both tasks, while existing methods can only
solve one of them. Experiments show that our method achieves performance better
than or comparable to several state-of-the-art NVS and VP methods on indoor and
outdoor real-world datasets.
Related papers
- Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding [44.99833362998488]
We introduce a method that projects visual traces of key points from observations onto depth maps, enabling models to capture both spatial and temporal information simultaneously.<n>Experiments in SimplerEnv show that the mean number of tasks successfully solved increased for 4% compared to SpatialVLA and 19% compared to TraceVLA.
arXiv Detail & Related papers (2025-08-12T15:53:45Z) - CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models [12.150101028377565]
We present CoSpace, a benchmark designed to assess the Continuous Space perception ability for Vision-Language Models (VLMs)
Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones.
arXiv Detail & Related papers (2025-03-18T11:31:58Z) - Scene Summarization: Clustering Scene Videos into Spatially Diverse Frames [23.229623379422303]
Scene summarization is the task of condensing long, continuous scene videos into a compact set of spatially diverses that facilitate global spatial reasoning.<n>We propose SceneSum, a two-stage self-supervised pipeline that first clusters video frames using visual place recognition to promote spatial diversity, then selects representatives from each cluster under resource constraints.<n> Experiments on real and simulated indoor datasets show that SceneSum produces more spatially informative summaries and outperforms existing video summarization baselines.
arXiv Detail & Related papers (2023-11-28T22:18:26Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Learning Space-Time Semantic Correspondences [68.06065984976365]
Given a source video, a target video, and a set of space-time key-points in the source video, the task requires predicting a set of keypoints in the target video.
We believe that this task is important for fine-grain video understanding, potentially enabling applications such as activity coaching, sports analysis, robot imitation learning, and more.
arXiv Detail & Related papers (2023-06-16T23:15:12Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint
Perception and Prediction in Vision-Centric Autonomous Driving [45.785865869298576]
Vision-centric joint perception and prediction has become an emerging trend in autonomous driving research.
It predicts the future states of the participants in the surrounding environment from raw RGB images.
It is still a critical challenge to synchronize features obtained at multiple camera views and timestamps.
arXiv Detail & Related papers (2023-03-17T14:20:28Z) - STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond [78.129039340528]
We propose a temporal-aware unit (STAU) for video prediction and beyond.
Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
arXiv Detail & Related papers (2022-04-20T13:42:51Z) - Space-time Neural Irradiance Fields for Free-Viewpoint Video [54.436478702701244]
We present a method that learns a neural irradiance field for dynamic scenes from a single video.
Our learned representation enables free-view rendering of the input video.
arXiv Detail & Related papers (2020-11-25T18:59:28Z) - VPN: Learning Video-Pose Embedding for Activities of Daily Living [6.719751155411075]
Recent 3DNets are too rigid to capture subtle visual patterns across an action.
We propose a novel Video-temporal Network: VPN.
Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset.
arXiv Detail & Related papers (2020-07-06T20:39:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.