STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond
- URL: http://arxiv.org/abs/2204.09456v1
- Date: Wed, 20 Apr 2022 13:42:51 GMT
- Title: STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond
- Authors: Zheng Chang, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao
- Abstract summary: We propose a temporal-aware unit (STAU) for video prediction and beyond.
Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
- Score: 78.129039340528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video prediction aims to predict future frames by modeling the complex
spatiotemporal dynamics in videos. However, most of the existing methods only
model the temporal information and the spatial information for videos in an
independent manner but haven't fully explored the correlations between both
terms. In this paper, we propose a SpatioTemporal-Aware Unit (STAU) for video
prediction and beyond by exploring the significant spatiotemporal correlations
in videos. On the one hand, the motion-aware attention weights are learned from
the spatial states to help aggregate the temporal states in the temporal
domain. On the other hand, the appearance-aware attention weights are learned
from the temporal states to help aggregate the spatial states in the spatial
domain. In this way, the temporal information and the spatial information can
be greatly aware of each other in both domains, during which, the
spatiotemporal receptive field can also be greatly broadened for more reliable
spatiotemporal modeling. Experiments are not only conducted on traditional
video prediction tasks but also other tasks beyond video prediction, including
the early action recognition and object detection tasks. Experimental results
show that our STAU can outperform other methods on all tasks in terms of
performance and computation efficiency.
Related papers
- Triplet Attention Transformer for Spatiotemporal Predictive Learning [9.059462850026216]
We propose an innovative triplet attention transformer designed to capture both inter-frame dynamics and intra-frame static features.
The model incorporates the Triplet Attention Module (TAM), which replaces traditional recurrent units by exploring self-attention mechanisms in temporal, spatial, and channel dimensions.
arXiv Detail & Related papers (2023-10-28T12:49:33Z) - On the Importance of Spatial Relations for Few-shot Action Recognition [109.2312001355221]
In this paper, we investigate the importance of spatial relations and propose a more accurate few-shot action recognition method.
A novel Spatial Alignment Cross Transformer (SA-CT) learns to re-adjust the spatial relations and incorporates the temporal information.
Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks.
arXiv Detail & Related papers (2023-08-14T12:58:02Z) - Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications.
Traditional methods rely on hand-crafted features and machine learning techniques.
We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z) - TempSAL -- Uncovering Temporal Information for Deep Saliency Prediction [64.63645677568384]
We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals.
Our approach locally modulates the saliency predictions by combining the learned temporal maps.
Our code will be publicly available on GitHub.
arXiv Detail & Related papers (2023-01-05T22:10:16Z) - Spatio-temporal Tendency Reasoning for Human Body Pose and Shape
Estimation from Videos [10.50306784245168]
We present atemporal tendency reasoning (STR) network for recovering human body pose shape from videos.
Our STR aims to learn accurate and spatial motion sequences in an unconstrained environment.
Our STR remains competitive with the state-of-the-art on three datasets.
arXiv Detail & Related papers (2022-10-07T16:09:07Z) - ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos [91.29436920371003]
We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI)
We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
arXiv Detail & Related papers (2021-05-25T07:54:35Z) - Learning Self-Similarity in Space and Time as Generalized Motion for
Action Recognition [42.175450800733785]
We propose a rich motion representation based on video self-similarity (STSS)
We leverage the whole volume of STSSS and let our model learn to extract an effective motion representation from it.
The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision.
arXiv Detail & Related papers (2021-02-14T07:32:55Z) - A Spatial-Temporal Attentive Network with Spatial Continuity for
Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC)
First, spatial-temporal attention mechanism is presented to explore the most useful and important information.
Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.