ACSNet: Action-Context Separation Network for Weakly Supervised Temporal
Action Localization
- URL: http://arxiv.org/abs/2103.15088v1
- Date: Sun, 28 Mar 2021 09:20:54 GMT
- Title: ACSNet: Action-Context Separation Network for Weakly Supervised Temporal
Action Localization
- Authors: Ziyi Liu, Le Wang, Qilin Zhang, Wei Tang, Junsong Yuan, Nanning Zheng,
Gang Hua
- Abstract summary: We introduce an Action-Context Separation Network (ACSNet) that takes into account context for accurate action localization.
ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.
- Score: 148.55210919689986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The object of Weakly-supervised Temporal Action Localization (WS-TAL) is to
localize all action instances in an untrimmed video with only video-level
supervision. Due to the lack of frame-level annotations during training,
current WS-TAL methods rely on attention mechanisms to localize the foreground
snippets or frames that contribute to the video-level classification task. This
strategy frequently confuse context with the actual action, in the localization
result. Separating action and context is a core problem for precise WS-TAL, but
it is very challenging and has been largely ignored in the literature. In this
paper, we introduce an Action-Context Separation Network (ACSNet) that
explicitly takes into account context for accurate action localization. It
consists of two branches (i.e., the Foreground-Background branch and the
Action-Context branch). The Foreground- Background branch first distinguishes
foreground from background within the entire video while the Action-Context
branch further separates the foreground as action and context. We associate
video snippets with two latent components (i.e., a positive component and a
negative component), and their different combinations can effectively
characterize foreground, action and context. Furthermore, we introduce extended
labels with auxiliary context categories to facilitate the learning of
action-context separation. Experiments on THUMOS14 and ActivityNet v1.2/v1.3
datasets demonstrate the ACSNet outperforms existing state-of-the-art WS-TAL
methods by a large margin.
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Foreground-Action Consistency Network for Weakly Supervised Temporal
Action Localization [66.66545680550782]
We present a framework named FAC-Net, on which three branches are appended, named class-wise foreground classification branch, class-agnostic attention branch and multiple instance learning branch.
First, our class-wise foreground classification branch regularizes the relation between actions and foreground to maximize the foreground-background separation.
Besides, the class-agnostic attention branch and multiple instance learning branch are adopted to regularize the foreground-action consistency and help to learn a meaningful foreground.
arXiv Detail & Related papers (2021-08-14T12:34:44Z) - ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal
Action Localization [18.56421375743287]
We propose an action-context modeling network termed ACM-Net.
It integrates a three-branch attention module to measure the likelihood of each temporal point being action instance, context, or non-action background, simultaneously.
Our method can outperform current state-of-the-art methods, and even achieve comparable performance with fully-supervised methods.
arXiv Detail & Related papers (2021-04-07T07:39:57Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Learning to Localize Actions from Moments [153.54638582696128]
We introduce a new design of transfer learning type to learn action localization for a large set of action categories.
We present Action Herald Networks (AherNet) that integrate such design into an one-stage action localization framework.
arXiv Detail & Related papers (2020-08-31T16:03:47Z) - Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.