PcmNet: Position-Sensitive Context Modeling Network for Temporal Action
Localization
- URL: http://arxiv.org/abs/2103.05270v1
- Date: Tue, 9 Mar 2021 07:34:01 GMT
- Title: PcmNet: Position-Sensitive Context Modeling Network for Temporal Action
Localization
- Authors: Xin Qin, Hanbin Zhao, Guangchen Lin, Hao Zeng, Songcen Xu, Xi Li
- Abstract summary: We propose a temporal-position-sensitive context modeling approach to incorporate both positional and semantic information for more precise action localization.
We achieve state-of-the-art performance on both two challenging datasets, THUMOS-14 and ActivityNet-1.3.
- Score: 11.685362686431446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action localization is an important and challenging task that aims
to locate temporal regions in real-world untrimmed videos where actions occur
and recognize their classes. It is widely acknowledged that video context is a
critical cue for video understanding, and exploiting the context has become an
important strategy to boost localization performance. However, previous
state-of-the-art methods focus more on exploring semantic context which
captures the feature similarity among frames or proposals, and neglect
positional context which is vital for temporal localization. In this paper, we
propose a temporal-position-sensitive context modeling approach to incorporate
both positional and semantic information for more precise action localization.
Specifically, we first augment feature representations with directed temporal
positional encoding, and then conduct attention-based information propagation,
in both frame-level and proposal-level. Consequently, the generated feature
representations are significantly empowered with the discriminative capability
of encoding the position-aware context information, and thus benefit boundary
detection and proposal evaluation. We achieve state-of-the-art performance on
both two challenging datasets, THUMOS-14 and ActivityNet-1.3, demonstrating the
effectiveness and generalization ability of our method.
Related papers
- Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Spatio-temporal Relation Modeling for Few-shot Action Recognition [100.3999454780478]
We propose a few-shot action recognition framework, STRM, which enhances class-specific featureriminability while simultaneously learning higher-order temporal representations.
Our approach achieves an absolute gain of 3.5% in classification accuracy, as compared to the best existing method in the literature.
arXiv Detail & Related papers (2021-12-09T18:59:14Z) - Action Shuffling for Weakly Supervised Temporal Localization [22.43209053892713]
This paper analyzes the order-sensitive and location-insensitive properties of actions.
It embodies them into a self-augmented learning framework to improve the weakly supervised action localization performance.
arXiv Detail & Related papers (2021-05-10T09:05:58Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.