Weakly-Supervised Action Localization by Generative Attention Modeling
- URL: http://arxiv.org/abs/2003.12424v2
- Date: Mon, 30 Mar 2020 14:36:48 GMT
- Title: Weakly-Supervised Action Localization by Generative Attention Modeling
- Authors: Baifeng Shi, Qi Dai, Yadong Mu, Jingdong Wang
- Abstract summary: Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
- Score: 65.03548422403061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised temporal action localization is a problem of learning an
action localization model with only video-level action labeling available. The
general framework largely relies on the classification activation, which
employs an attention model to identify the action-related frames and then
categorizes them into different classes. Such method results in the
action-context confusion issue: context frames near action clips tend to be
recognized as action frames themselves, since they are closely related to the
specific classes. To solve the problem, in this paper we propose to model the
class-agnostic frame-wise probability conditioned on the frame attention using
conditional Variational Auto-Encoder (VAE). With the observation that the
context exhibits notable difference from the action at representation level, a
probabilistic model, i.e., conditional VAE, is learned to model the likelihood
of each frame given the attention. By maximizing the conditional probability
with respect to the attention, the action and non-action frames are well
separated. Experiments on THUMOS14 and ActivityNet1.2 demonstrate advantage of
our method and effectiveness in handling action-context confusion problem. Code
is now available on GitHub.
Related papers
- Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - Foreground-Action Consistency Network for Weakly Supervised Temporal
Action Localization [66.66545680550782]
We present a framework named FAC-Net, on which three branches are appended, named class-wise foreground classification branch, class-agnostic attention branch and multiple instance learning branch.
First, our class-wise foreground classification branch regularizes the relation between actions and foreground to maximize the foreground-background separation.
Besides, the class-agnostic attention branch and multiple instance learning branch are adopted to regularize the foreground-action consistency and help to learn a meaningful foreground.
arXiv Detail & Related papers (2021-08-14T12:34:44Z) - Weakly Supervised Action Selection Learning in Video [8.337649176647645]
Action Selection Learning is proposed to capture the general concept of action, a property we refer to as "actionness"
We show that ASL outperforms leading baselines on two popular benchmarks THUMOS-14 and ActivityNet-1.2, with 10.3% and 5.7% relative improvement respectively.
arXiv Detail & Related papers (2021-05-06T04:39:29Z) - ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal
Action Localization [18.56421375743287]
We propose an action-context modeling network termed ACM-Net.
It integrates a three-branch attention module to measure the likelihood of each temporal point being action instance, context, or non-action background, simultaneously.
Our method can outperform current state-of-the-art methods, and even achieve comparable performance with fully-supervised methods.
arXiv Detail & Related papers (2021-04-07T07:39:57Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - ACSNet: Action-Context Separation Network for Weakly Supervised Temporal
Action Localization [148.55210919689986]
We introduce an Action-Context Separation Network (ACSNet) that takes into account context for accurate action localization.
ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.
arXiv Detail & Related papers (2021-03-28T09:20:54Z) - Action Localization through Continual Predictive Learning [14.582013761620738]
We present a new approach based on continual learning that uses feature-level predictions for self-supervision.
We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames.
This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization.
arXiv Detail & Related papers (2020-03-26T23:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.