Weakly-supervised Action Localization via Hierarchical Mining
- URL: http://arxiv.org/abs/2206.11011v1
- Date: Wed, 22 Jun 2022 12:19:09 GMT
- Title: Weakly-supervised Action Localization via Hierarchical Mining
- Authors: Jia-Chang Feng, Fa-Ting Hong, Jia-Run Du, Zhongang Qi, Ying Shan,
Xiaohu Qie, Wei-Shi Zheng, Jianping Wu
- Abstract summary: Weakly-supervised action localization aims to localize and classify action instances in the given videos temporally with only video-level categorical labels.
We propose a hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining.
We show that HiM-Net outperforms existing methods on THUMOS14 and ActivityNet1.3 datasets with large margins by hierarchically mining the supervision and consistency.
- Score: 76.00021423700497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised action localization aims to localize and classify action
instances in the given videos temporally with only video-level categorical
labels. Thus, the crucial issue of existing weakly-supervised action
localization methods is the limited supervision from the weak annotations for
precise predictions. In this work, we propose a hierarchical mining strategy
under video-level and snippet-level manners, i.e., hierarchical supervision and
hierarchical consistency mining, to maximize the usage of the given annotations
and prediction-wise consistency. To this end, a Hierarchical Mining Network
(HiM-Net) is proposed. Concretely, it mines hierarchical supervision for
classification in two grains: one is the video-level existence for ground truth
categories captured by multiple instance learning; the other is the
snippet-level inexistence for each negative-labeled category from the
perspective of complementary labels, which is optimized by our proposed
complementary label learning. As for hierarchical consistency, HiM-Net explores
video-level co-action feature similarity and snippet-level
foreground-background opposition, for discriminative representation learning
and consistent foreground-background separation. Specifically, prediction
variance is viewed as uncertainty to select the pairs with high consensus for
proposed foreground-background collaborative learning. Comprehensive
experimental results show that HiM-Net outperforms existing methods on THUMOS14
and ActivityNet1.3 datasets with large margins by hierarchically mining the
supervision and consistency. Code will be available on GitHub.
Related papers
- Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions [19.741453194665276]
We formalize the novel task of hierarchical video recognition, and propose a video-language learning framework tailored for hierarchical recognition.
Specifically, our framework encodes dependencies between hierarchical category levels, and applies a top-down constraint to filter recognition predictions.
We demonstrate the efficacy of our approach for hierarchical recognition, significantly outperforming conventional methods.
arXiv Detail & Related papers (2024-05-28T01:17:22Z) - Revisiting Foreground and Background Separation in Weakly-supervised
Temporal Action Localization: A Clustering-based Approach [48.684550829098534]
Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels.
We propose a novel clustering-based F&B separation algorithm.
We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3.
arXiv Detail & Related papers (2023-12-21T18:57:12Z) - Panoptic Out-of-Distribution Segmentation [11.388678390784195]
We propose Panoptic Out-of Distribution for joint pixel-level semantic in-distribution and out-of-distribution classification with instance prediction.
We make the dataset, code, and trained models publicly available at http://pods.cs.uni-freiburg.de.
arXiv Detail & Related papers (2023-10-18T08:38:31Z) - Weakly-Supervised Action Localization by Hierarchically-structured
Latent Attention Modeling [19.683714649646603]
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels.
Most existing models rely on multiple instance learning(MIL), where predictions of unlabeled instances are supervised by classifying labeled bags.
We propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics.
arXiv Detail & Related papers (2023-08-19T08:45:49Z) - Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Point-Level Temporal Action Localization: Bridging Fully-supervised
Proposals to Weakly-supervised Losses [84.2964408497058]
Point-level temporal action localization (PTAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance.
Existing methods adopt the frame-level prediction paradigm to learn from the sparse single-frame labels.
This paper attempts to explore the proposal-based prediction paradigm for point-level annotations.
arXiv Detail & Related papers (2020-12-15T12:11:48Z) - Mixup-CAM: Weakly-supervised Semantic Segmentation via Uncertainty
Regularization [73.03956876752868]
We propose a principled and end-to-end train-able framework to allow the network to pay attention to other parts of the object.
Specifically, we introduce the mixup data augmentation scheme into the classification network and design two uncertainty regularization terms to better interact with the mixup strategy.
arXiv Detail & Related papers (2020-08-03T21:19:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.