Adaptive Perception Transformer for Temporal Action Localization
- URL: http://arxiv.org/abs/2208.11908v1
- Date: Thu, 25 Aug 2022 07:42:48 GMT
- Title: Adaptive Perception Transformer for Temporal Action Localization
- Authors: Yizheng Ouyang, Tianjin Zhang, Weibo Gu, Hongfa Wang, Liming Wang, and
Xiaojie Guo
- Abstract summary: This paper proposes a novel end-to-end model, called adaptive perception transformer (AdaPerFormer)
One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts.
The other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information.
- Score: 13.735402329482719
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal action localization aims to predict the boundary and category of
each action instance in untrimmed long videos. Most of previous methods based
on anchors or proposals neglect the global-local context interaction in entire
video sequences. Besides, their multi-stage designs cannot generate action
boundaries and categories straightforwardly. To address the above issues, this
paper proposes a novel end-to-end model, called adaptive perception transformer
(AdaPerFormer for short). Specifically, AdaPerFormer explores a dual-branch
multi-head self-attention mechanism. One branch takes care of the global
perception attention, which can model entire video sequences and aggregate
global relevant contexts. While the other branch concentrates on the local
convolutional shift to aggregate intra-frame and inter-frame information
through our bidirectional shift operation. The end-to-end nature produces the
boundaries and categories of video actions without extra steps. Extensive
experiments together with ablation studies are provided to reveal the
effectiveness of our design. Our method achieves a state-of-the-art accuracy on
the THUMOS14 dataset (65.8\% in terms of mAP@0.5, 42.6\% mAP@0.7, and 62.7\%
mAP@Avg), and obtains competitive performance on the ActivityNet-1.3 dataset
with an average mAP of 36.1\%. The code and models are available at
https://github.com/SouperO/AdaPerFormer.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods.
Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z) - Temporal Action Localization with Multi-temporal Scales [54.69057924183867]
We propose to predict actions on a feature space of multi-temporal scales.
Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales.
The proposed method can achieve improvements of 12.6%, 17.4% and 2.2%, respectively.
arXiv Detail & Related papers (2022-08-16T01:48:23Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action
Localization [12.353250130848044]
We present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions.
Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset.
arXiv Detail & Related papers (2021-01-03T03:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.