ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization
- URL: http://arxiv.org/abs/2203.15187v1
- Date: Tue, 29 Mar 2022 01:59:26 GMT
- Title: ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization
- Authors: Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xin Zhou, Abhinav
Shrivastava
- Abstract summary: Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training.
We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods.
Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
- Score: 36.90693762365237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised temporal action localization aims to recognize and localize
action segments in untrimmed videos given only video-level action labels for
training. Without the boundary information of action segments, existing methods
mostly rely on multiple instance learning (MIL), where the predictions of
unlabeled instances (i.e., video snippets) are supervised by classifying
labeled bags (i.e., untrimmed videos). However, this formulation typically
treats snippets in a video as independent instances, ignoring the underlying
temporal structures within and across action segments. To address this problem,
we propose \system, a novel WTAL framework that enables explicit, action-aware
segment modeling beyond standard MIL-based methods. Our framework entails three
segment-centric components: (i) dynamic segment sampling for compensating the
contribution of short actions; (ii) intra- and inter-segment attention for
modeling action dynamics and capturing temporal dependencies; (iii) pseudo
instance-level supervision for improving action boundary prediction.
Furthermore, a multi-step refinement strategy is proposed to progressively
improve action proposals along the model training process. Extensive
experiments on THUMOS-14 and ActivityNet-v1.3 demonstrate the effectiveness of
our approach, establishing new state of the art on both datasets. The code and
models are publicly available at~\url{https://github.com/boheumd/ASM-Loc}.
Related papers
- POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization [26.506893363676678]
This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action localization.
POTLoc is designed to identify and track continuous action structures via a self-training strategy.
It outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets.
arXiv Detail & Related papers (2023-10-20T15:28:06Z) - BIT: Bi-Level Temporal Modeling for Efficient Supervised Action
Segmentation [34.88225099758585]
supervised action segmentation aims to partition a video into non-overlapping segments, each representing a different action.
Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost.
We propose an efficient BI-level Temporal modeling framework that learns explicit action tokens to represent action segments.
arXiv Detail & Related papers (2023-08-28T20:59:15Z) - Weakly-Supervised Action Localization by Hierarchically-structured
Latent Attention Modeling [19.683714649646603]
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels.
Most existing models rely on multiple instance learning(MIL), where predictions of unlabeled instances are supervised by classifying labeled bags.
We propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics.
arXiv Detail & Related papers (2023-08-19T08:45:49Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Temporal Segment Transformer for Action Segmentation [54.25103250496069]
We propose an attention based approach which we call textittemporal segment transformer, for joint segment relation modeling and denoising.
The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments.
We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks.
arXiv Detail & Related papers (2023-02-25T13:05:57Z) - Graph Convolutional Module for Temporal Action Localization in Videos [142.5947904572949]
We claim that the relations between action units play an important role in action localization.
A more powerful action detector should not only capture the local content of each action unit but also allow a wider field of view on the context related to it.
We propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods.
arXiv Detail & Related papers (2021-12-01T06:36:59Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Action Shuffle Alternating Learning for Unsupervised Action Segmentation [38.32743770719661]
We train an RNN to recognize positive and negative action sequences, and the RNN's hidden layer is taken as our new action-level feature embedding.
As supervision of actions is not available, we specify an HMM that explicitly models action lengths, and infer a MAP action segmentation with the Viterbi algorithm.
The resulting action segmentation is used as pseudo-ground truth for estimating our action-level feature embedding and updating the HMM.
arXiv Detail & Related papers (2021-04-05T18:58:57Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Modeling Multi-Label Action Dependencies for Temporal Action
Localization [53.53490517832068]
Real-world videos contain many complex actions with inherent relationships between action classes.
We propose an attention-based architecture that models these action relationships for the task of temporal action localization in unoccurrence videos.
We show improved performance over state-of-the-art methods on multi-label action localization benchmarks.
arXiv Detail & Related papers (2021-03-04T13:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.