Foreground-Action Consistency Network for Weakly Supervised Temporal
Action Localization
- URL: http://arxiv.org/abs/2108.06524v1
- Date: Sat, 14 Aug 2021 12:34:44 GMT
- Title: Foreground-Action Consistency Network for Weakly Supervised Temporal
Action Localization
- Authors: Linjiang Huang, Liang Wang, Hongsheng Li
- Abstract summary: We present a framework named FAC-Net, on which three branches are appended, named class-wise foreground classification branch, class-agnostic attention branch and multiple instance learning branch.
First, our class-wise foreground classification branch regularizes the relation between actions and foreground to maximize the foreground-background separation.
Besides, the class-agnostic attention branch and multiple instance learning branch are adopted to regularize the foreground-action consistency and help to learn a meaningful foreground.
- Score: 66.66545680550782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a challenging task of high-level video understanding, weakly supervised
temporal action localization has been attracting increasing attention. With
only video annotations, most existing methods seek to handle this task with a
localization-by-classification framework, which generally adopts a selector to
select snippets of high probabilities of actions or namely the foreground.
Nevertheless, the existing foreground selection strategies have a major
limitation of only considering the unilateral relation from foreground to
actions, which cannot guarantee the foreground-action consistency. In this
paper, we present a framework named FAC-Net based on the I3D backbone, on which
three branches are appended, named class-wise foreground classification branch,
class-agnostic attention branch and multiple instance learning branch. First,
our class-wise foreground classification branch regularizes the relation
between actions and foreground to maximize the foreground-background
separation. Besides, the class-agnostic attention branch and multiple instance
learning branch are adopted to regularize the foreground-action consistency and
help to learn a meaningful foreground classifier. Within each branch, we
introduce a hybrid attention mechanism, which calculates multiple attention
scores for each snippet, to focus on both discriminative and
less-discriminative snippets to capture the full action boundaries.
Experimental results on THUMOS14 and ActivityNet1.3 demonstrate the
state-of-the-art performance of our method. Our code is available at
https://github.com/LeonHLJ/FAC-Net.
Related papers
- Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning [34.133790456747626]
We propose a novel method to generate personalized features for each branch based on the image content.
Our method incorporates a Multi-Level Feature Aggregation (MFA) module to generate personalized features for each branch based on the image content.
arXiv Detail & Related papers (2024-08-30T08:13:06Z) - Revisiting Foreground and Background Separation in Weakly-supervised
Temporal Action Localization: A Clustering-based Approach [48.684550829098534]
Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels.
We propose a novel clustering-based F&B separation algorithm.
We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3.
arXiv Detail & Related papers (2023-12-21T18:57:12Z) - Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal
Action Localization [18.56421375743287]
We propose an action-context modeling network termed ACM-Net.
It integrates a three-branch attention module to measure the likelihood of each temporal point being action instance, context, or non-action background, simultaneously.
Our method can outperform current state-of-the-art methods, and even achieve comparable performance with fully-supervised methods.
arXiv Detail & Related papers (2021-04-07T07:39:57Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - ACSNet: Action-Context Separation Network for Weakly Supervised Temporal
Action Localization [148.55210919689986]
We introduce an Action-Context Separation Network (ACSNet) that takes into account context for accurate action localization.
ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.
arXiv Detail & Related papers (2021-03-28T09:20:54Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.