SSAN: Separable Self-Attention Network for Video Representation Learning
- URL: http://arxiv.org/abs/2105.13033v1
- Date: Thu, 27 May 2021 10:02:04 GMT
- Title: SSAN: Separable Self-Attention Network for Video Representation Learning
- Authors: Xudong Guo, Xun Guo, Yan Lu
- Abstract summary: We propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially.
By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning.
Our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets.
- Score: 11.542048296046524
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-attention has been successfully applied to video representation learning
due to the effectiveness of modeling long range dependencies. Existing
approaches build the dependencies merely by computing the pairwise correlations
along spatial and temporal dimensions simultaneously. However, spatial
correlations and temporal correlations represent different contextual
information of scenes and temporal reasoning. Intuitively, learning spatial
contextual information first will benefit temporal modeling. In this paper, we
propose a separable self-attention (SSA) module, which models spatial and
temporal correlations sequentially, so that spatial contexts can be efficiently
used in temporal modeling. By adding SSA module into 2D CNN, we build a SSA
network (SSAN) for video representation learning. On the task of video action
recognition, our approach outperforms state-of-the-art methods on
Something-Something and Kinetics-400 datasets. Our models often outperform
counterparts with shallower network and fewer modalities. We further verify the
semantic learning ability of our method in visual-language task of video
retrieval, which showcases the homogeneity of video representations and text
embeddings. On MSR-VTT and Youcook2 datasets, video representations learnt by
SSA significantly improve the state-of-the-art performance.
Related papers
- Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Surgical Skill Assessment via Video Semantic Aggregation [20.396898001950156]
We propose a skill assessment framework, Video Semantic Aggregation (ViSA), which discovers different semantic parts and aggregates them acrosstemporal dimensions.
The explicit discovery of semantic parts provides an explanatory visualization that helps understand the neural network's decisions.
The experiments on two datasets show the competitiveness of ViSA compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-08-04T12:24:01Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Comparison of Spatiotemporal Networks for Learning Video Related Tasks [0.0]
Many methods for learning from sequences involve temporally processing 2D CNN features from the individual frames or directly utilizing 3D convolutions within high-performing 2D CNN architectures.
This work constructs an MNIST-based video dataset with parameters controlling relevant facets of common video-related tasks: classification, ordering, and speed estimation.
Models trained on this dataset are shown to differ in key ways depending on the task and their use of 2D convolutions, 3D convolutions, or convolutional LSTMs.
arXiv Detail & Related papers (2020-09-15T19:57:50Z) - IAUnet: Global Context-Aware Feature Learning for Person
Re-Identification [106.50534744965955]
IAU block enables the feature to incorporate the globally spatial, temporal, and channel context.
It is lightweight, end-to-end trainable, and can be easily plugged into existing CNNs to form IAUnet.
Experiments show that IAUnet performs favorably against state-of-the-art on both image and video reID tasks.
arXiv Detail & Related papers (2020-09-02T13:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.