Structured Attention Composition for Temporal Action Localization
- URL: http://arxiv.org/abs/2205.09956v1
- Date: Fri, 20 May 2022 04:32:09 GMT
- Title: Structured Attention Composition for Temporal Action Localization
- Authors: Le Yang, Junwei Han, Tao Zhao, Nian Liu, Dingwen Zhang
- Abstract summary: We study temporal action localization from the perspective of multi-modality feature learning.
Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently.
The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks.
- Score: 99.66510088698051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action localization aims at localizing action instances from
untrimmed videos. Existing works have designed various effective modules to
precisely localize action instances based on appearance and motion features.
However, by treating these two kinds of features with equal importance,
previous works cannot take full advantage of each modality feature, making the
learned model still sub-optimal. To tackle this issue, we make an early effort
to study temporal action localization from the perspective of multi-modality
feature learning, based on the observation that different actions exhibit
specific preferences to appearance or motion modality. Specifically, we build a
novel structured attention composition module. Unlike conventional attention,
the proposed module would not infer frame attention and modality attention
independently. Instead, by casting the relationship between the modality
attention and the frame attention as an attention assignment process, the
structured attention composition module learns to encode the frame-modality
structure and uses it to regularize the inferred frame attention and modality
attention, respectively, upon the optimal transport theory. The final
frame-modality attention is obtained by the composition of the two individual
attentions. The proposed structured attention composition module can be
deployed as a plug-and-play module into existing action localization
frameworks. Extensive experiments on two widely used benchmarks show that the
proposed structured attention composition consistently improves four
state-of-the-art temporal action localization methods and builds new
state-of-the-art performance on THUMOS14. Code is availabel at
https://github.com/VividLe/Online-Action-Detection.
Related papers
- Learning Correlation Structures for Vision Transformers [93.22434535223587]
We introduce a new attention mechanism, dubbed structural self-attention (StructSA)
We generate attention maps by recognizing space-time structures of key-query correlations via convolution.
This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations.
arXiv Detail & Related papers (2024-04-05T07:13:28Z) - Spatiotemporal Multi-scale Bilateral Motion Network for Gait Recognition [3.1240043488226967]
In this paper, motivated by optical flow, the bilateral motion-oriented features are proposed.
We develop a set of multi-scale temporal representations that force the motion context to be richly described at various levels of temporal resolution.
arXiv Detail & Related papers (2022-09-26T01:36:22Z) - ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training.
We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods.
Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - Revisiting spatio-temporal layouts for compositional action recognition [63.04778884595353]
We take an object-centric approach to action recognition.
The main focus of this paper is compositional/few-shot action recognition.
We demonstrate how to improve the performance of appearance-based models by fusion with layout-based models.
arXiv Detail & Related papers (2021-11-02T23:04:39Z) - Cross-modal Consensus Network for Weakly Supervised Temporal Action
Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision.
We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.