Spatio-Temporal Self-Attention Network for Video Saliency Prediction
- URL: http://arxiv.org/abs/2108.10696v1
- Date: Tue, 24 Aug 2021 12:52:47 GMT
- Title: Spatio-Temporal Self-Attention Network for Video Saliency Prediction
- Authors: Ziqiang Wang, Zhi Liu, Gongyang Li, Tianhong Zhang, Lihua Xu, Jijun
Wang
- Abstract summary: 3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
- Score: 13.873682190242365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D convolutional neural networks have achieved promising results for video
tasks in computer vision, including video saliency prediction that is explored
in this paper. However, 3D convolution encodes visual representation merely on
fixed local spacetime according to its kernel size, while human attention is
always attracted by relational visual features at different time of a video. To
overcome this limitation, we propose a novel Spatio-Temporal Self-Attention 3D
Network (STSANet) for video saliency prediction, in which multiple
Spatio-Temporal Self-Attention (STSA) modules are employed at different levels
of 3D convolutional backbone to directly capture long-range relations between
spatio-temporal features of different time steps. Besides, we propose an
Attentional Multi-Scale Fusion (AMSF) module to integrate multi-level features
with the perception of context in semantic and spatio-temporal subspaces.
Extensive experiments demonstrate the contributions of key components of our
method, and the results on DHF1K, Hollywood-2, UCF, and DIEM benchmark datasets
clearly prove the superiority of the proposed model compared with all
state-of-the-art models.
Related papers
- RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Comparison of Spatiotemporal Networks for Learning Video Related Tasks [0.0]
Many methods for learning from sequences involve temporally processing 2D CNN features from the individual frames or directly utilizing 3D convolutions within high-performing 2D CNN architectures.
This work constructs an MNIST-based video dataset with parameters controlling relevant facets of common video-related tasks: classification, ordering, and speed estimation.
Models trained on this dataset are shown to differ in key ways depending on the task and their use of 2D convolutions, 3D convolutions, or convolutional LSTMs.
arXiv Detail & Related papers (2020-09-15T19:57:50Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z) - Interpreting video features: a comparison of 3D convolutional networks
and convolutional LSTM networks [1.462434043267217]
We compare how 3D convolutional networks and convolutional LSTM networks learn features across temporally dependent frames.
Our findings indicate that the 3D convolutional model concentrates on shorter events in the input sequence, and places its spatial focus on fewer, contiguous areas.
arXiv Detail & Related papers (2020-02-02T11:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.