Enriching Local and Global Contexts for Temporal Action Localization
- URL: http://arxiv.org/abs/2107.12960v1
- Date: Tue, 27 Jul 2021 17:25:40 GMT
- Title: Enriching Local and Global Contexts for Temporal Action Localization
- Authors: Zixin Zhu (Xi'an jiaotong University), Wei Tang (University of
Illinois at Chicago), Le Wang (Xi'an Jiaotong University), Nanning Zheng
(Xi'an Jiaotong University), Gang Hua (Wormpex AI Research)
- Abstract summary: We enrich both the local and global contexts in the popular two-stage temporal localization framework.
Our proposed model, dubbed ContextLoc, can be divided into three sub-networks: L-Net, G-Net and P-Net.
The efficacy of our proposed method is validated by experimental results on the THUMOS14 (54.3% at IoU@0.5) and ActivityNet v1.3 (51.24% at IoU@0.5) datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effectively tackling the problem of temporal action localization (TAL)
necessitates a visual representation that jointly pursues two confounding
goals, i.e., fine-grained discrimination for temporal localization and
sufficient visual invariance for action classification. We address this
challenge by enriching both the local and global contexts in the popular
two-stage temporal localization framework, where action proposals are first
generated followed by action classification and temporal boundary regression.
Our proposed model, dubbed ContextLoc, can be divided into three sub-networks:
L-Net, G-Net and P-Net. L-Net enriches the local context via fine-grained
modeling of snippet-level features, which is formulated as a
query-and-retrieval process. G-Net enriches the global context via higher-level
modeling of the video-level representation. In addition, we introduce a novel
context adaptation module to adapt the global context to different proposals.
P-Net further models the context-aware inter-proposal relations. We explore two
existing models to be the P-Net in our experiments. The efficacy of our
proposed method is validated by experimental results on the THUMOS14 (54.3\% at
IoU@0.5) and ActivityNet v1.3 (51.24\% at IoU@0.5) datasets, which outperforms
recent states of the art.
Related papers
- BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning [26.400567961735234]
Correspondence pruning aims to establish reliable correspondences between two related images.
Existing approaches often employ a progressive strategy to handle the local and global contexts.
We propose a parallel context learning strategy that involves acquiring bilateral consensus for the two-view correspondence pruning task.
arXiv Detail & Related papers (2024-01-07T11:38:15Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - PRA-Net: Point Relation-Aware Network for 3D Point Cloud Analysis [56.91758845045371]
We propose a novel framework named Point Relation-Aware Network (PRA-Net)
It is composed of an Intra-region Structure Learning (ISL) module and an Inter-region Relation Learning (IRL) module.
Experiments on several 3D benchmarks covering shape classification, keypoint estimation, and part segmentation have verified the effectiveness and the ability of PRA-Net.
arXiv Detail & Related papers (2021-12-09T13:24:43Z) - Cross-modal Consensus Network for Weakly Supervised Temporal Action
Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision.
We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - PcmNet: Position-Sensitive Context Modeling Network for Temporal Action
Localization [11.685362686431446]
We propose a temporal-position-sensitive context modeling approach to incorporate both positional and semantic information for more precise action localization.
We achieve state-of-the-art performance on both two challenging datasets, THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2021-03-09T07:34:01Z) - BSN++: Complementary Boundary Regressor with Scale-Balanced Relation
Modeling for Temporal Action Proposal Generation [85.13713217986738]
We present BSN++, a new framework which exploits complementary boundary regressor and relation modeling for temporal proposal generation.
Not surprisingly, the proposed BSN++ ranked 1st place in the CVPR19 - ActivityNet challenge leaderboard on temporal action localization task.
arXiv Detail & Related papers (2020-09-15T07:08:59Z) - Complementary Boundary Generator with Scale-Invariant Relation Modeling
for Temporal Action Localization: Submission to ActivityNet Challenge 2020 [66.4527310659592]
This report presents an overview of our solution used in the submission to ActivityNet Challenge 2020 Task 1.
We decouple the temporal action localization task into two stages (i.e. proposal generation and classification) and enrich the proposal diversity.
Our proposed scheme achieves the state-of-the-art performance on the temporal action localization task with textbf42.26 average mAP on the challenge testing set.
arXiv Detail & Related papers (2020-07-20T04:35:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.