CAST: Cross-Attention in Space and Time for Video Action Recognition
- URL: http://arxiv.org/abs/2311.18825v1
- Date: Thu, 30 Nov 2023 18:58:51 GMT
- Title: CAST: Cross-Attention in Space and Time for Video Action Recognition
- Authors: Dongho Lee, Jongseo Lee, Jinwoo Choi
- Abstract summary: We propose a novel two-stream architecture called Cross-Attention in Space and Time (CAST)
CAST achieves a balanced spatial-temporal understanding of videos using only balanced input.
Our proposed mechanism enables spatial and temporal expert models to exchange information and make synergistic predictions.
- Score: 9.87369251604361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing human actions in videos requires spatial and temporal
understanding. Most existing action recognition models lack a balanced
spatio-temporal understanding of videos. In this work, we propose a novel
two-stream architecture, called Cross-Attention in Space and Time (CAST), that
achieves a balanced spatio-temporal understanding of videos using only RGB
input. Our proposed bottleneck cross-attention mechanism enables the spatial
and temporal expert models to exchange information and make synergistic
predictions, leading to improved performance. We validate the proposed method
with extensive experiments on public benchmarks with different characteristics:
EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method
consistently shows favorable performance across these datasets, while the
performance of existing methods fluctuates depending on the dataset
characteristics.
Related papers
- Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - On the Importance of Spatial Relations for Few-shot Action Recognition [109.2312001355221]
In this paper, we investigate the importance of spatial relations and propose a more accurate few-shot action recognition method.
A novel Spatial Alignment Cross Transformer (SA-CT) learns to re-adjust the spatial relations and incorporates the temporal information.
Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks.
arXiv Detail & Related papers (2023-08-14T12:58:02Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time.
"TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches.
TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.