Relational Self-Attention: What's Missing in Attention for Video
Understanding
- URL: http://arxiv.org/abs/2111.01673v1
- Date: Tue, 2 Nov 2021 15:36:11 GMT
- Title: Relational Self-Attention: What's Missing in Attention for Video
Understanding
- Authors: Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho
- Abstract summary: We introduce a relational feature transform, dubbed the relational self-attention (RSA)
Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts.
- Score: 52.38780998425556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolution has been arguably the most important feature transform for modern
neural networks, leading to the advance of deep learning. Recent emergence of
Transformer networks, which replace convolution layers with self-attention
blocks, has revealed the limitation of stationary convolution kernels and
opened the door to the era of dynamic feature transforms. The existing dynamic
transforms, including self-attention, however, are all limited for video
understanding where correspondence relations in space and time, i.e., motion
information, are crucial for effective representation. In this work, we
introduce a relational feature transform, dubbed the relational self-attention
(RSA), that leverages rich structures of spatio-temporal relations in videos by
dynamically generating relational kernels and aggregating relational contexts.
Our experiments and ablation studies show that the RSA network substantially
outperforms convolution and self-attention counterparts, achieving the state of
the art on the standard motion-centric benchmarks for video action recognition,
such as Something-Something-V1 & V2, Diving48, and FineGym.
Related papers
- RepVideo: Rethinking Cross-Layer Representation for Video Generation [53.701548524818534]
We propose RepVideo, an enhanced representation framework for text-to-video diffusion models.
By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information.
Our experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, but also improves temporal consistency in video generation.
arXiv Detail & Related papers (2025-01-15T18:20:37Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.
Our key insight is that large video foundation models can act as both neurals and implicit physics simulators by learning interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - Todyformer: Towards Holistic Dynamic Graph Transformers with
Structure-Aware Tokenization [6.799413002613627]
Todyformer is a novel Transformer-based neural network tailored for dynamic graphs.
It unifies the local encoding capacity of Message-Passing Neural Networks (MPNNs) with the global encoding of Transformers.
We show that Todyformer consistently outperforms the state-of-the-art methods for downstream tasks.
arXiv Detail & Related papers (2024-02-02T23:05:30Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Convolution-enhanced Evolving Attention Networks [41.684265133316096]
Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly.
This is the first work that explicitly models the layer-wise evolution of attention maps.
arXiv Detail & Related papers (2022-12-16T08:14:04Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.