Related papers: Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding

URL: http://arxiv.org/abs/2111.01673v1
Date: Tue, 2 Nov 2021 15:36:11 GMT
Title: Relational Self-Attention: What's Missing in Attention for Video Understanding
Authors: Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho
Abstract summary: We introduce a relational feature transform, dubbed the relational self-attention (RSA) Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts.
Score: 52.38780998425556
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

Related papers

RepVideo: Rethinking Cross-Layer Representation for Video Generation [53.701548524818534]
We propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. Our experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, but also improves temporal consistency in video generation.
arXiv Detail & Related papers (2025-01-15T18:20:37Z)
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z)
Todyformer: Towards Holistic Dynamic Graph Transformers with Structure-Aware Tokenization [6.799413002613627]
Todyformer is a novel Transformer-based neural network tailored for dynamic graphs. It unifies the local encoding capacity of Message-Passing Neural Networks (MPNNs) with the global encoding of Transformers. We show that Todyformer consistently outperforms the state-of-the-art methods for downstream tasks.
arXiv Detail & Related papers (2024-02-02T23:05:30Z)
Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS) Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z)
Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
Convolution-enhanced Evolving Attention Networks [41.684265133316096]
Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly. This is the first work that explicitly models the layer-wise evolution of attention maps.
arXiv Detail & Related papers (2022-12-16T08:14:04Z)
Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better. Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs. Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z)
Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input. We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture. The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z)
Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene. ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z)
Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes. Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse. The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.