Elaborative Rehearsal for Zero-shot Action Recognition
- URL: http://arxiv.org/abs/2108.02833v1
- Date: Thu, 5 Aug 2021 20:02:46 GMT
- Title: Elaborative Rehearsal for Zero-shot Action Recognition
- Authors: Shizhe Chen and Dong Huang
- Abstract summary: ZSAR aims to recognize target (unseen) actions without training examples.
It remains challenging to semantically represent action classes and transfer knowledge from seen data.
We propose an ER-enhanced ZSAR model inspired by an effective human memory technique Elaborative Rehearsal.
- Score: 36.84404523161848
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing number of action classes has posed a new challenge for video
understanding, making Zero-Shot Action Recognition (ZSAR) a thriving direction.
The ZSAR task aims to recognize target (unseen) actions without training
examples by leveraging semantic representations to bridge seen and unseen
actions. However, due to the complexity and diversity of actions, it remains
challenging to semantically represent action classes and transfer knowledge
from seen data. In this work, we propose an ER-enhanced ZSAR model inspired by
an effective human memory technique Elaborative Rehearsal (ER), which involves
elaborating a new concept and relating it to known concepts. Specifically, we
expand each action class as an Elaborative Description (ED) sentence, which is
more discriminative than a class name and less costly than manual-defined
attributes. Besides directly aligning class semantics with videos, we
incorporate objects from the video as Elaborative Concepts (EC) to improve
video semantics and generalization from seen actions to unseen actions. Our
ER-enhanced ZSAR model achieves state-of-the-art results on three existing
benchmarks. Moreover, we propose a new ZSAR evaluation protocol on the Kinetics
dataset to overcome limitations of current benchmarks and demonstrate the first
case where ZSAR performance is comparable to few-shot learning baselines on
this more realistic setting. We will release our codes and collected EDs at
https://github.com/DeLightCMU/ElaborativeRehearsal.
Related papers
- Self-supervised Multi-actor Social Activity Understanding in Streaming Videos [6.4149117677272525]
Social Activity Recognition (SAR) is a critical component in real-world tasks like surveillance and assistive robotics.
Previous SAR research has relied heavily on densely annotated data, but privacy concerns limit their applicability in real-world settings.
We propose a self-supervised approach based on multi-actor predictive learning for SAR in streaming videos.
arXiv Detail & Related papers (2024-06-20T16:33:54Z) - ActionHub: A Large-scale Action Video Description Dataset for Zero-shot
Action Recognition [35.08592533014102]
Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen actions that is transferable to unseen actions.
We propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual Cross-modality Alignment module and a Cross-action Invariance Mining module.
arXiv Detail & Related papers (2024-01-22T02:21:26Z) - Language-free Compositional Action Generation via Decoupling Refinement [67.50452446686725]
We introduce a novel framework to generate compositional actions without reliance on language auxiliaries.
Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement.
arXiv Detail & Related papers (2023-07-07T12:00:38Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Intent Contrastive Learning for Sequential Recommendation [86.54439927038968]
We introduce a latent variable to represent users' intents and learn the distribution function of the latent variable via clustering.
We propose to leverage the learned intents into SR models via contrastive SSL, which maximizes the agreement between a view of sequence and its corresponding intent.
Experiments conducted on four real-world datasets demonstrate the superiority of the proposed learning paradigm.
arXiv Detail & Related papers (2022-02-05T09:24:13Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Home Action Genome: Cooperative Compositional Action Understanding [33.69990813932372]
Existing research on action recognition treats activities as monolithic events occurring in videos.
Cooperative Compositional Action Understanding (CCAU) is a cooperative learning framework for hierarchical action recognition.
We demonstrate the utility of co-learning compositions in few-shot action recognition by achieving 28.6% mAP with just a single sample.
arXiv Detail & Related papers (2021-05-11T17:42:47Z) - Modular Action Concept Grounding in Semantic Video Prediction [28.917125574895422]
We introduce the task of semantic action-conditional video prediction, which uses semantic action labels to describe interactions.
Inspired by the idea of Mixture of Experts, we embody each abstract label by a structured combination of various visual concept learners.
Our method is evaluated on two newly designed synthetic datasets and one real-world dataset.
arXiv Detail & Related papers (2020-11-23T04:12:22Z) - Learning to Represent Action Values as a Hypergraph on the Action
Vertices [17.811355496708728]
Action-value estimation is a critical component of reinforcement learning (RL) methods.
We conjecture that leveraging the structure of multi-dimensional action spaces is a key ingredient for learning good representations of action.
We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and discretised physical control benchmarks.
arXiv Detail & Related papers (2020-10-28T00:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.