D2-Net: Weakly-Supervised Action Localization via Discriminative
Embeddings and Denoised Activations
- URL: http://arxiv.org/abs/2012.06440v1
- Date: Fri, 11 Dec 2020 16:01:56 GMT
- Title: D2-Net: Weakly-Supervised Action Localization via Discriminative
Embeddings and Denoised Activations
- Authors: Sanath Narayan, Hisham Cholakkal, Munawar Hayat, Fahad Shahbaz Khan,
Ming-Hsuan Yang, Ling Shao
- Abstract summary: This work proposes a weakly-supervised temporal action localization framework, called D2-Net.
Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings.
Our D2-Net performs favorably in comparison to the existing methods on two datasets.
- Score: 172.05295776806773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work proposes a weakly-supervised temporal action localization
framework, called D2-Net, which strives to temporally localize actions using
video-level supervision. Our main contribution is the introduction of a novel
loss formulation, which jointly enhances the discriminability of latent
embeddings and robustness of the output temporal class activations with respect
to foreground-background noise caused by weak supervision. The proposed
formulation comprises a discriminative and a denoising loss term for enhancing
temporal action localization. The discriminative term incorporates a
classification loss and utilizes a top-down attention mechanism to enhance the
separability of latent foreground-background embeddings. The denoising loss
term explicitly addresses the foreground-background noise in class activations
by simultaneously maximizing intra-video and inter-video mutual information
using a bottom-up attention mechanism. As a result, activations in the
foreground regions are emphasized whereas those in the background regions are
suppressed, thereby leading to more robust predictions. Comprehensive
experiments are performed on two benchmarks: THUMOS14 and ActivityNet1.2. Our
D2-Net performs favorably in comparison to the existing methods on both
datasets, achieving gains as high as 3.6% in terms of mean average precision on
THUMOS14.
Related papers
- Motion-Scenario Decoupling for Rat-Aware Video Position Prediction:
Strategy and Benchmark [49.58762201363483]
We introduce RatPose, a bio-robot motion prediction dataset constructed by considering the influence factors of individuals and environments.
We propose a Dual-stream Motion-Scenario Decoupling framework that effectively separates scenario-oriented and motion-oriented features.
We demonstrate significant performance improvements of the proposed textitDMSD framework on different difficulty-level tasks.
arXiv Detail & Related papers (2023-05-17T14:14:31Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Dilation-Erosion for Single-Frame Supervised Temporal Action
Localization [28.945067347089825]
We present the Snippet Classification model and the Dilation-Erosion module.
The Dilation-Erosion module mines pseudo snippet-level ground-truth, hard backgrounds and evident backgrounds.
Experiments on THUMOS14 and ActivityNet 1.2 validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-13T03:05:13Z) - Locality-aware Attention Network with Discriminative Dynamics Learning
for Weakly Supervised Anomaly Detection [0.8883733362171035]
We propose a Discriminative Dynamics Learning (DDL) method with two objective functions, i.e., dynamics ranking loss and dynamics alignment loss.
A Locality-aware Attention Network (LA-Net) is constructed to capture global correlations and re-calibrate the location preference across snippets, followed by a multilayer perceptron with causal convolution to obtain anomaly scores.
arXiv Detail & Related papers (2022-08-11T04:27:33Z) - Forcing the Whole Video as Background: An Adversarial Learning Strategy
for Weakly Temporal Action Localization [6.919243767837342]
We present an adversarial learning strategy to break the limitation of mining pseudo background snippets.
A novel temporal enhancement network is designed to facilitate the model to construct temporal relation of affinity snippets.
arXiv Detail & Related papers (2022-07-14T05:13:50Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Action Shuffling for Weakly Supervised Temporal Localization [22.43209053892713]
This paper analyzes the order-sensitive and location-insensitive properties of actions.
It embodies them into a self-augmented learning framework to improve the weakly supervised action localization performance.
arXiv Detail & Related papers (2021-05-10T09:05:58Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Two-Stream Consensus Network for Weakly-Supervised Temporal Action
Localization [94.37084866660238]
We present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges.
The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated.
We propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries.
arXiv Detail & Related papers (2020-10-22T10:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.