Hierarchical Compositional Representations for Few-shot Action
Recognition
- URL: http://arxiv.org/abs/2208.09424v3
- Date: Fri, 19 Jan 2024 05:32:54 GMT
- Title: Hierarchical Compositional Representations for Few-shot Action
Recognition
- Authors: Changzhen Li, Jie Zhang, Shuzhe Wu, Xin Jin, and Shiguang Shan
- Abstract summary: We propose a novel hierarchical compositional representations (HCR) learning approach for few-shot action recognition.
We divide a complicated action into several sub-actions by carefully designed hierarchical clustering.
We also adopt the Earth Mover's Distance in the transportation problem to measure the similarity between video samples in terms of sub-action representations.
- Score: 51.288829293306335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently action recognition has received more and more attention for its
comprehensive and practical applications in intelligent surveillance and
human-computer interaction. However, few-shot action recognition has not been
well explored and remains challenging because of data scarcity. In this paper,
we propose a novel hierarchical compositional representations (HCR) learning
approach for few-shot action recognition. Specifically, we divide a complicated
action into several sub-actions by carefully designed hierarchical clustering
and further decompose the sub-actions into more fine-grained spatially
attentional sub-actions (SAS-actions). Although there exist large differences
between base classes and novel classes, they can share similar patterns in
sub-actions or SAS-actions. Furthermore, we adopt the Earth Mover's Distance in
the transportation problem to measure the similarity between video samples in
terms of sub-action representations. It computes the optimal matching flows
between sub-actions as distance metric, which is favorable for comparing
fine-grained patterns. Extensive experiments show our method achieves the
state-of-the-art results on HMDB51, UCF101 and Kinetics datasets.
Related papers
- An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - Learning to Represent Action Values as a Hypergraph on the Action
Vertices [17.811355496708728]
Action-value estimation is a critical component of reinforcement learning (RL) methods.
We conjecture that leveraging the structure of multi-dimensional action spaces is a key ingredient for learning good representations of action.
We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and discretised physical control benchmarks.
arXiv Detail & Related papers (2020-10-28T00:19:13Z) - Dynamic Semantic Matching and Aggregation Network for Few-shot Intent
Detection [69.2370349274216]
Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances.
Semantic components are distilled from utterances via multi-head self-attention.
Our method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances.
arXiv Detail & Related papers (2020-10-06T05:16:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.