Action Unit Memory Network for Weakly Supervised Temporal Action
Localization
- URL: http://arxiv.org/abs/2104.14135v1
- Date: Thu, 29 Apr 2021 06:19:44 GMT
- Title: Action Unit Memory Network for Weakly Supervised Temporal Action
Localization
- Authors: Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu,
Yongdong Zhang
- Abstract summary: Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training.
We present an Action Unit Memory Network (AUMN) for weakly supervised temporal action localization.
- Score: 124.61981738536642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised temporal action localization aims to detect and localize
actions in untrimmed videos with only video-level labels during training.
However, without frame-level annotations, it is challenging to achieve
localization completeness and relieve background interference. In this paper,
we present an Action Unit Memory Network (AUMN) for weakly supervised temporal
action localization, which can mitigate the above two challenges by learning an
action unit memory bank. In the proposed AUMN, two attention modules are
designed to update the memory bank adaptively and learn action units specific
classifiers. Furthermore, three effective mechanisms (diversity, homogeneity
and sparsity) are designed to guide the updating of the memory network. To the
best of our knowledge, this is the first work to explicitly model the action
units with a memory network. Extensive experimental results on two standard
benchmarks (THUMOS14 and ActivityNet) demonstrate that our AUMN performs
favorably against state-of-the-art methods. Specifically, the average mAP of
IoU thresholds from 0.1 to 0.5 on the THUMOS14 dataset is significantly
improved from 47.0% to 52.1%.
Related papers
- Weakly-Supervised Action Localization by Hierarchically-structured
Latent Attention Modeling [19.683714649646603]
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels.
Most existing models rely on multiple instance learning(MIL), where predictions of unlabeled instances are supervised by classifying labeled bags.
We propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics.
arXiv Detail & Related papers (2023-08-19T08:45:49Z) - Weakly-Supervised Temporal Action Localization by Inferring Salient
Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously.
Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video.
We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z) - Self-Attentive Pooling for Efficient Deep Learning [6.822466048176652]
We propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers.
We surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2%.
Our approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints.
arXiv Detail & Related papers (2022-09-16T00:35:14Z) - Adversarial Memory Networks for Action Prediction [95.09968654228372]
Action prediction aims to infer the forthcoming human action with partially-observed videos.
We propose adversarial memory networks (AMemNet) to generate the "full video" feature conditioning on a partial video query.
arXiv Detail & Related papers (2021-12-18T08:16:21Z) - Activation to Saliency: Forming High-Quality Labels for Unsupervised
Salient Object Detection [54.92703325989853]
We propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues.
No human annotations are involved in our framework during the whole training process.
Our framework reports significant performance compared with existing USOD methods.
arXiv Detail & Related papers (2021-12-07T11:54:06Z) - Cross-modal Consensus Network for Weakly Supervised Temporal Action
Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision.
We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action
Localization [12.353250130848044]
We present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions.
Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset.
arXiv Detail & Related papers (2021-01-03T03:08:18Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.