CLTA: Contents and Length-based Temporal Attention for Few-shot Action
Recognition
- URL: http://arxiv.org/abs/2103.10567v1
- Date: Thu, 18 Mar 2021 23:40:28 GMT
- Title: CLTA: Contents and Length-based Temporal Attention for Few-shot Action
Recognition
- Authors: Yang Bo, Yangdi Lu and Wenbo He
- Abstract summary: We propose a Contents and Length-based Temporal Attention model, which learns customized temporal attention for the individual video.
We show that even a not fine-tuned backbone with an ordinary softmax classifier can still achieve similar or better results compared to the state-of-the-art few-shot action recognition.
- Score: 2.0349696181833337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot action recognition has attracted increasing attention due to the
difficulty in acquiring the properly labelled training samples. Current works
have shown that preserving spatial information and comparing video descriptors
are crucial for few-shot action recognition. However, the importance of
preserving temporal information is not well discussed. In this paper, we
propose a Contents and Length-based Temporal Attention (CLTA) model, which
learns customized temporal attention for the individual video to tackle the
few-shot action recognition problem. CLTA utilizes the Gaussian likelihood
function as the template to generate temporal attention and trains the learning
matrices to study the mean and standard deviation based on both frame contents
and length. We show that even a not fine-tuned backbone with an ordinary
softmax classifier can still achieve similar or better results compared to the
state-of-the-art few-shot action recognition with precisely captured temporal
attention.
Related papers
- On the Importance of Spatial Relations for Few-shot Action Recognition [109.2312001355221]
In this paper, we investigate the importance of spatial relations and propose a more accurate few-shot action recognition method.
A novel Spatial Alignment Cross Transformer (SA-CT) learns to re-adjust the spatial relations and incorporates the temporal information.
Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks.
arXiv Detail & Related papers (2023-08-14T12:58:02Z) - Zero-shot Skeleton-based Action Recognition via Mutual Information
Estimation and Maximization [26.721082316870532]
Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories.
We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and estimation.
arXiv Detail & Related papers (2023-08-07T23:41:55Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Video Activity Localisation with Uncertainties in Temporal Boundary [74.7263952414899]
Methods for video activity localisation over time assume implicitly that activity temporal boundaries are determined and precise.
In unscripted natural videos, different activities transit smoothly, so that it is intrinsically ambiguous to determine in labelling precisely when an activity starts and ends over time.
We introduce Elastic Moment Bounding (EMB) to accommodate flexible and adaptive activity temporal boundaries.
arXiv Detail & Related papers (2022-06-26T16:45:56Z) - Class-Incremental Learning for Action Recognition in Videos [44.923719189467164]
We tackle catastrophic forgetting problem in the context of class-incremental learning for video recognition.
Our framework addresses this challenging task by introducing time-channel importance maps and exploiting the importance maps for learning the representations of incoming examples.
We evaluate the proposed approach on brand-new splits of class-incremental action recognition benchmarks constructed upon the UCF101, HMDB51, and Something-Something V2 datasets.
arXiv Detail & Related papers (2022-03-25T12:15:49Z) - Stacked Temporal Attention: Improving First-person Action Recognition by
Emphasizing Discriminative Clips [39.29955809641396]
Many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process.
Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video.
We propose a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips.
arXiv Detail & Related papers (2021-12-02T08:02:35Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Action Forecasting with Feature-wise Self-Attention [20.068238091354583]
We present a new architecture for human action forecasting from videos.
A temporal recurrent encoder captures temporal information of input videos.
A self-attention model is used to attend on relevant feature dimensions of the input space.
arXiv Detail & Related papers (2021-07-19T01:55:30Z) - Few-shot Action Recognition with Permutation-invariant Attention [169.61294360056925]
We build on a C3D encoder for encoded video blocks to capture short-range action patterns.
We exploit spatial and temporal attention modules and naturalistic self-supervision.
Our method outperforms the state of the art on the HMDB51, UCF101, miniMIT datasets.
arXiv Detail & Related papers (2020-01-12T10:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.