Relational Self-Attention: What's Missing in Attention for Video
Understanding
- URL: http://arxiv.org/abs/2111.01673v1
- Date: Tue, 2 Nov 2021 15:36:11 GMT
- Title: Relational Self-Attention: What's Missing in Attention for Video
Understanding
- Authors: Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho
- Abstract summary: We introduce a relational feature transform, dubbed the relational self-attention (RSA)
Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts.
- Score: 52.38780998425556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolution has been arguably the most important feature transform for modern
neural networks, leading to the advance of deep learning. Recent emergence of
Transformer networks, which replace convolution layers with self-attention
blocks, has revealed the limitation of stationary convolution kernels and
opened the door to the era of dynamic feature transforms. The existing dynamic
transforms, including self-attention, however, are all limited for video
understanding where correspondence relations in space and time, i.e., motion
information, are crucial for effective representation. In this work, we
introduce a relational feature transform, dubbed the relational self-attention
(RSA), that leverages rich structures of spatio-temporal relations in videos by
dynamically generating relational kernels and aggregating relational contexts.
Our experiments and ablation studies show that the RSA network substantially
outperforms convolution and self-attention counterparts, achieving the state of
the art on the standard motion-centric benchmarks for video action recognition,
such as Something-Something-V1 & V2, Diving48, and FineGym.
Related papers
- Todyformer: Towards Holistic Dynamic Graph Transformers with
Structure-Aware Tokenization [6.799413002613627]
Todyformer is a novel Transformer-based neural network tailored for dynamic graphs.
It unifies the local encoding capacity of Message-Passing Neural Networks (MPNNs) with the global encoding of Transformers.
We show that Todyformer consistently outperforms the state-of-the-art methods for downstream tasks.
arXiv Detail & Related papers (2024-02-02T23:05:30Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Convolution-enhanced Evolving Attention Networks [41.684265133316096]
Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly.
This is the first work that explicitly models the layer-wise evolution of attention maps.
arXiv Detail & Related papers (2022-12-16T08:14:04Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.