Related papers: SIAM: A Simple Alternating Mixer for Video Prediction

SIAM: A Simple Alternating Mixer for Video Prediction

URL: http://arxiv.org/abs/2311.11683v2
Date: Mon, 20 May 2024 16:46:02 GMT
Title: SIAM: A Simple Alternating Mixer for Video Prediction
Authors: Xin Zheng, Ziang Peng, Yuan Cao, Hongming Shan, Junping Zhang,
Abstract summary: Video predicting future frames from the previous ones has broad applications as autonomous driving and forecasting weather. We explicitly model these features in a unified encoder-decoder framework and propose a novel simple simple (SIAM) SIAM lies in the design of alternating mixing (Da) blocks, which can model spatial, temporal, andtemporal features.
Score: 42.03590872477933
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video prediction, predicting future frames from the previous ones, has broad applications such as autonomous driving and weather forecasting. Existing state-of-the-art methods typically focus on extracting either spatial, temporal, or spatiotemporal features from videos. Different feature focuses, resulting from different network architectures, may make the resultant models excel at some video prediction tasks but perform poorly on others. Towards a more generic video prediction solution, we explicitly model these features in a unified encoder-decoder framework and propose a novel simple alternating Mixer (SIAM). The novelty of SIAM lies in the design of dimension alternating mixing (DaMi) blocks, which can model spatial, temporal, and spatiotemporal features through alternating the dimensions of the feature maps. Extensive experimental results demonstrate the superior performance of the proposed SIAM on four benchmark video datasets covering both synthetic and real-world scenarios.

Related papers

VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method. We distribute features of space-time tubes evenly across a limited number of learnable clusters. Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction [46.687394176382746]
Existing approaches for video prediction build up their models based on a Single-In-Single-Out (SISO) architecture. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursion.
arXiv Detail & Related papers (2022-12-09T03:57:13Z)
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding. We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences. Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z)
Flexible Diffusion Modeling of Long Videos [15.220686350342385]
We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA self-driving car simulator.
arXiv Detail & Related papers (2022-05-23T17:51:48Z)
STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond [78.129039340528]
We propose a temporal-aware unit (STAU) for video prediction and beyond. Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
arXiv Detail & Related papers (2022-04-20T13:42:51Z)
STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction [78.129039340528]
We propose a StemporalResidual Predictive Model (STRPM) for high-resolution video prediction. STRPM can generate more satisfactory results compared with various existing methods. Experimental results show that STRPM can generate more satisfactory results compared with various existing methods.
arXiv Detail & Related papers (2022-03-30T06:24:00Z)
Video Prediction at Multiple Scales with Hierarchical Recurrent Networks [24.536256844130996]
We propose a novel video prediction model able to forecast future possible outcomes of different levels of granularity simultaneously. By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations over long time horizons. In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations on various scenarios.
arXiv Detail & Related papers (2022-03-17T13:08:28Z)
Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision. We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.