Stacked Temporal Attention: Improving First-person Action Recognition by
Emphasizing Discriminative Clips
- URL: http://arxiv.org/abs/2112.01038v1
- Date: Thu, 2 Dec 2021 08:02:35 GMT
- Title: Stacked Temporal Attention: Improving First-person Action Recognition by
Emphasizing Discriminative Clips
- Authors: Lijin Yang, Yifei Huang, Yusuke Sugano, Yoichi Sato
- Abstract summary: Many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process.
Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video.
We propose a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips.
- Score: 39.29955809641396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: First-person action recognition is a challenging task in video understanding.
Because of strong ego-motion and a limited field of view, many backgrounds or
noisy frames in a first-person video can distract an action recognition model
during its learning process. To encode more discriminative features, the model
needs to have the ability to focus on the most relevant part of the video for
action recognition. Previous works explored to address this problem by applying
temporal attention but failed to consider the global context of the full video,
which is critical for determining the relatively significant parts. In this
work, we propose a simple yet effective Stacked Temporal Attention Module
(STAM) to compute temporal attention based on the global knowledge across clips
for emphasizing the most discriminative features. We achieve this by stacking
multiple self-attention layers. Instead of naive stacking, which is
experimentally proven to be ineffective, we carefully design the input to each
self-attention layer so that both the local and global context of the video is
considered during generating the temporal attention weights. Experiments
demonstrate that our proposed STAM can be built on top of most existing
backbones and boost the performance in various datasets.
Related papers
- No More Shortcuts: Realizing the Potential of Temporal Self-Supervision [69.59938105887538]
We propose a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks.
We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision.
arXiv Detail & Related papers (2023-12-20T13:20:31Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Alignment-guided Temporal Attention for Video Action Recognition [18.5171795689609]
We show that frame-by-frame alignments have the potential to increase the mutual information between frame representations.
We propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames.
arXiv Detail & Related papers (2022-09-30T23:10:47Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - CLTA: Contents and Length-based Temporal Attention for Few-shot Action
Recognition [2.0349696181833337]
We propose a Contents and Length-based Temporal Attention model, which learns customized temporal attention for the individual video.
We show that even a not fine-tuned backbone with an ordinary softmax classifier can still achieve similar or better results compared to the state-of-the-art few-shot action recognition.
arXiv Detail & Related papers (2021-03-18T23:40:28Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z) - Self-supervised Temporal Discriminative Learning for Video
Representation Learning [39.43942923911425]
Temporal-discriminative features can hardly be extracted without using an annotated large-scale video action dataset for training.
This paper proposes a novel Video-based Temporal-Discriminative Learning framework in self-supervised manner.
arXiv Detail & Related papers (2020-08-05T13:36:59Z) - Egocentric Action Recognition by Video Attention and Temporal Context [83.57475598382146]
We present the submission of Samsung AI Centre Cambridge to the CVPR 2020 EPIC-Kitchens Action Recognition Challenge.
In this challenge, action recognition is posed as the problem of simultaneously predicting a single verb' and noun' class label given an input trimmed video clip.
Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data.
arXiv Detail & Related papers (2020-07-03T18:00:32Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.