Forcing the Whole Video as Background: An Adversarial Learning Strategy
for Weakly Temporal Action Localization
- URL: http://arxiv.org/abs/2207.06659v1
- Date: Thu, 14 Jul 2022 05:13:50 GMT
- Title: Forcing the Whole Video as Background: An Adversarial Learning Strategy
for Weakly Temporal Action Localization
- Authors: Ziqiang Li, Yongxin Ge, Jiaruo Yu, and Zhongming Chen
- Abstract summary: We present an adversarial learning strategy to break the limitation of mining pseudo background snippets.
A novel temporal enhancement network is designed to facilitate the model to construct temporal relation of affinity snippets.
- Score: 6.919243767837342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With video-level labels, weakly supervised temporal action localization
(WTAL) applies a localization-by-classification paradigm to detect and classify
the action in untrimmed videos. Due to the characteristic of classification,
class-specific background snippets are inevitably mis-activated to improve the
discriminability of the classifier in WTAL. To alleviate the disturbance of
background, existing methods try to enlarge the discrepancy between action and
background through modeling background snippets with pseudo-snippet-level
annotations, which largely rely on artificial hypotheticals. Distinct from the
previous works, we present an adversarial learning strategy to break the
limitation of mining pseudo background snippets. Concretely, the background
classification loss forces the whole video to be regarded as the background by
a background gradient reinforcement strategy, confusing the recognition model.
Reversely, the foreground(action) loss guides the model to focus on action
snippets under such conditions. As a result, competition between the two
classification losses drives the model to boost its ability for action
modeling. Simultaneously, a novel temporal enhancement network is designed to
facilitate the model to construct temporal relation of affinity snippets based
on the proposed strategy, for further improving the performance of action
localization. Finally, extensive experiments conducted on THUMOS14 and
ActivityNet1.2 demonstrate the effectiveness of the proposed method.
Related papers
- Bayesian Learning-driven Prototypical Contrastive Loss for Class-Incremental Learning [42.14439854721613]
We propose a prototypical network with a Bayesian learning-driven contrastive loss (BLCL) tailored specifically for class-incremental learning scenarios.
Our approach dynamically adapts the balance between the cross-entropy and contrastive loss functions with a Bayesian learning technique.
arXiv Detail & Related papers (2024-05-17T19:49:02Z) - SOAR: Scene-debiasing Open-set Action Recognition [81.8198917049666]
We propose Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module.
The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning.
The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information.
arXiv Detail & Related papers (2023-09-03T20:20:48Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Dilation-Erosion for Single-Frame Supervised Temporal Action
Localization [28.945067347089825]
We present the Snippet Classification model and the Dilation-Erosion module.
The Dilation-Erosion module mines pseudo snippet-level ground-truth, hard backgrounds and evident backgrounds.
Experiments on THUMOS14 and ActivityNet 1.2 validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-13T03:05:13Z) - Zero-Shot Temporal Action Detection via Vision-Language Prompting [134.26292288193298]
We propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE)
Our model significantly outperforms state-of-the-art alternatives.
Our model also yields superior results on supervised TAD over recent strong competitors.
arXiv Detail & Related papers (2022-07-17T13:59:46Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - MCDAL: Maximum Classifier Discrepancy for Active Learning [74.73133545019877]
Recent state-of-the-art active learning methods have mostly leveraged Generative Adversarial Networks (GAN) for sample acquisition.
We propose in this paper a novel active learning framework that we call Maximum Discrepancy for Active Learning (MCDAL)
In particular, we utilize two auxiliary classification layers that learn tighter decision boundaries by maximizing the discrepancies among them.
arXiv Detail & Related papers (2021-07-23T06:57:08Z) - D2-Net: Weakly-Supervised Action Localization via Discriminative
Embeddings and Denoised Activations [172.05295776806773]
This work proposes a weakly-supervised temporal action localization framework, called D2-Net.
Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings.
Our D2-Net performs favorably in comparison to the existing methods on two datasets.
arXiv Detail & Related papers (2020-12-11T16:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.