Weakly-Supervised Temporal Action Localization by Inferring Salient
Snippet-Feature
- URL: http://arxiv.org/abs/2303.12332v3
- Date: Sun, 24 Dec 2023 05:29:23 GMT
- Title: Weakly-Supervised Temporal Action Localization by Inferring Salient
Snippet-Feature
- Authors: Wulian Yun, Mengshi Qi, Chuanming Wang, Huadong Ma
- Abstract summary: Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously.
Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video.
We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
- Score: 26.7937345622207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised temporal action localization aims to locate action regions
and identify action categories in untrimmed videos simultaneously by taking
only video-level labels as the supervision. Pseudo label generation is a
promising strategy to solve the challenging problem, but the current methods
ignore the natural temporal structure of the video that can provide rich
information to assist such a generation process. In this paper, we propose a
novel weakly-supervised temporal action localization method by inferring
salient snippet-feature. First, we design a saliency inference module that
exploits the variation relationship between temporal neighbor snippets to
discover salient snippet-features, which can reflect the significant dynamic
change in the video. Secondly, we introduce a boundary refinement module that
enhances salient snippet-features through the information interaction unit.
Then, a discrimination enhancement module is introduced to enhance the
discriminative nature of snippet-features. Finally, we adopt the refined
snippet-features to produce high-fidelity pseudo labels, which could be used to
supervise the training of the action localization network. Extensive
experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet
v1.3, demonstrate our proposed method achieves significant improvements
compared to the state-of-the-art methods.
Related papers
- Unsupervised Temporal Action Localization via Self-paced Incremental
Learning [57.55765505856969]
We present a novel self-paced incremental learning model to enhance clustering and localization training simultaneously.
We design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Cross-Video Contextual Knowledge Exploration and Exploitation for
Ambiguity Reduction in Weakly Supervised Temporal Action Localization [23.94629999419033]
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels.
Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset.
Our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
arXiv Detail & Related papers (2023-08-24T07:19:59Z) - Weakly-Supervised Action Localization by Hierarchically-structured
Latent Attention Modeling [19.683714649646603]
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels.
Most existing models rely on multiple instance learning(MIL), where predictions of unlabeled instances are supervised by classifying labeled bags.
We propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics.
arXiv Detail & Related papers (2023-08-19T08:45:49Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - Weakly Supervised Temporal Action Localization via Representative
Snippet Knowledge Propagation [36.86505596138256]
Weakly supervised temporal action localization aims to localize temporal boundaries of actions and simultaneously identify their categories with only video-level category labels.
Many existing methods seek to generate pseudo labels for bridging the discrepancy between classification and localization, but usually only make use of limited contextual information for pseudo label generation.
Our method seeks to mine the representative snippets in each video for propagating information between video snippets to generate better pseudo labels.
arXiv Detail & Related papers (2022-03-06T09:53:55Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Action Shuffling for Weakly Supervised Temporal Localization [22.43209053892713]
This paper analyzes the order-sensitive and location-insensitive properties of actions.
It embodies them into a self-augmented learning framework to improve the weakly supervised action localization performance.
arXiv Detail & Related papers (2021-05-10T09:05:58Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.