Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning
- URL: http://arxiv.org/abs/2108.06647v1
- Date: Sun, 15 Aug 2021 02:21:01 GMT
- Title: Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning
- Authors: Jiahao Wang, Yunhong Wang, Sheng Liu, Annan Li
- Abstract summary: Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
- Score: 51.03781020616402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained action recognition is attracting increasing attention due to the
emerging demand of specific action understanding in real-world applications,
whereas the data of rare fine-grained categories is very limited. Therefore, we
propose the few-shot fine-grained action recognition problem, aiming to
recognize novel fine-grained actions with only few samples given for each
class. Although progress has been made in coarse-grained actions, existing
few-shot recognition methods encounter two issues handling fine-grained
actions: the inability to capture subtle action details and the inadequacy in
learning from data with low inter-class variance. To tackle the first issue, a
human vision inspired bidirectional attention module (BAM) is proposed.
Combining top-down task-driven signals with bottom-up salient stimuli, BAM
captures subtle action details by accurately highlighting informative
spatio-temporal regions. To address the second issue, we introduce contrastive
meta-learning (CML). Compared with the widely adopted ProtoNet-based method,
CML generates more discriminative video representations for low inter-class
variance data, since it makes full use of potential contrastive pairs in each
training episode. Furthermore, to fairly compare different models, we establish
specific benchmark protocols on two large-scale fine-grained action recognition
datasets. Extensive experiments show that our method consistently achieves
state-of-the-art performance across evaluated tasks.
Related papers
- FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements.
Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition.
We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks [4.971065912401385]
We propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition.
Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification.
We validate our method on the Charades dataset that includes a majority of object-based actions.
arXiv Detail & Related papers (2024-05-14T15:28:48Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components.
First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective.
Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.