Action Sensitivity Learning for Temporal Action Localization
- URL: http://arxiv.org/abs/2305.15701v2
- Date: Wed, 13 Sep 2023 11:52:30 GMT
- Title: Action Sensitivity Learning for Temporal Action Localization
- Authors: Jiayi Shao and Xiaohan Wang and Ruijie Quan and Junjun Zheng and Jiang
Yang and Yi Yang
- Abstract summary: We propose an Action Sensitivity Learning framework (ASL) to tackle the task of temporal action localization.
We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively.
Based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames.
- Score: 35.65086250175736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action localization (TAL), which involves recognizing and locating
action instances, is a challenging task in video understanding. Most existing
approaches directly predict action classes and regress offsets to boundaries,
while overlooking the discrepant importance of each frame. In this paper, we
propose an Action Sensitivity Learning framework (ASL) to tackle this task,
which aims to assess the value of each frame and then leverage the generated
action sensitivity to recalibrate the training procedure. We first introduce a
lightweight Action Sensitivity Evaluator to learn the action sensitivity at the
class level and instance level, respectively. The outputs of the two branches
are combined to reweight the gradient of the two sub-tasks. Moreover, based on
the action sensitivity of each frame, we design an Action Sensitive Contrastive
Loss to enhance features, where the action-aware frames are sampled as positive
pairs to push away the action-irrelevant frames. The extensive studies on
various action localization benchmarks (i.e., MultiThumos, Charades,
Ego4D-Moment Queries v1.0, Epic-Kitchens 100, Thumos14 and ActivityNet1.3) show
that ASL surpasses the state-of-the-art in terms of average-mAP under multiple
types of scenarios, e.g., single-labeled, densely-labeled and egocentric.
Related papers
- FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements.
Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition.
We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z) - The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks [4.971065912401385]
We propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition.
Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification.
We validate our method on the Charades dataset that includes a majority of object-based actions.
arXiv Detail & Related papers (2024-05-14T15:28:48Z) - Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation [34.11373539564126]
This study focuses on a novel task in text-to-image (T2I) generation, namely action customization.
The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals.
arXiv Detail & Related papers (2023-11-27T14:07:13Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Learning Action Completeness from Points for Weakly-supervised Temporal
Action Localization [15.603643098270409]
We tackle the problem of localizing temporal intervals of actions with only a single frame label for each action instance for training.
In this paper, we propose a novel framework, where dense pseudo-labels are generated to provide completeness guidance for the model.
arXiv Detail & Related papers (2021-08-11T04:54:39Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.