Pair-wise Layer Attention with Spatial Masking for Video Prediction
- URL: http://arxiv.org/abs/2311.11289v1
- Date: Sun, 19 Nov 2023 10:29:05 GMT
- Title: Pair-wise Layer Attention with Spatial Masking for Video Prediction
- Authors: Ping Li, Chenhan Zhang, Zheng Yang, Xianghua Xu, Mingli Song
- Abstract summary: We develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps.
We also present a Pair-wise Layer Attention with Spatial Masking (SM-SM) framework for Translator prediction.
- Score: 46.17429511620538
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video prediction yields future frames by employing the historical frames and
has exhibited its great potential in many applications, e.g., meteorological
prediction, and autonomous driving. Previous works often decode the ultimate
high-level semantic features to future frames without texture details, which
deteriorates the prediction quality. Motivated by this, we develop a Pair-wise
Layer Attention (PLA) module to enhance the layer-wise semantic dependency of
the feature maps derived from the U-shape structure in Translator, by coupling
low-level visual cues and high-level features. Hence, the texture details of
predicted frames are enriched. Moreover, most existing methods capture the
spatiotemporal dynamics by Translator, but fail to sufficiently utilize the
spatial features of Encoder. This inspires us to design a Spatial Masking (SM)
module to mask partial encoding features during pretraining, which adds the
visibility of remaining feature pixels by Decoder. To this end, we present a
Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video
prediction to capture the spatiotemporal dynamics, which reflect the motion
trend. Extensive experiments and rigorous ablation studies on five benchmarks
demonstrate the advantages of the proposed approach. The code is available at
GitHub.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Hierarchical Graph Pattern Understanding for Zero-Shot VOS [102.21052200245457]
This paper proposes a new hierarchical graph neural network (GNN) architecture for zero-shot video object segmentation (ZS-VOS)
Inspired by the strong ability of GNNs in capturing structural relations, HGPU innovatively leverages motion cues (ie, optical flow) to enhance the high-order representations from the neighbors of target frames.
arXiv Detail & Related papers (2023-12-15T04:13:21Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z) - A new way of video compression via forward-referencing using deep
learning [0.0]
This paper explores a new way of video coding by modelling human pose from the already-encoded frames.
It is expected that the proposed approach can overcome the limitations of the traditional backward-referencing frames.
Experimental results show that the proposed approach can achieve on average up to 2.83 dB PSNR gain and 25.93% residual savings for high motion video sequences.
arXiv Detail & Related papers (2022-08-13T16:19:11Z) - SatMAE: Pre-training Transformers for Temporal and Multi-Spectral
Satellite Imagery [74.82821342249039]
We present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE)
To leverage temporal information, we include a temporal embedding along with independently masking image patches across time.
arXiv Detail & Related papers (2022-07-17T01:35:29Z) - CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones.
It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.