Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation
- URL: http://arxiv.org/abs/2103.16024v1
- Date: Tue, 30 Mar 2021 02:01:03 GMT
- Title: Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation
- Authors: Shuning Chang, Pichao Wang, Fan Wang, Hao Li, Jiashi Feng
- Abstract summary: We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
- Score: 79.98992138865042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action proposal generation (TAPG) is a fundamental and challenging
task in video understanding, especially in temporal action detection. Most
previous works focus on capturing the local temporal context and can well
locate simple action instances with clean frames and clear boundaries. However,
they generally fail in complicated scenarios where interested actions involve
irrelevant frames and background clutters, and the local temporal context
becomes less effective. To deal with these problems, we present an augmented
transformer with adaptive graph network (ATAG) to exploit both long-range and
local temporal contexts for TAPG. Specifically, we enhance the vanilla
transformer by equipping a snippet actionness loss and a front block, dubbed
augmented transformer, and it improves the abilities of capturing long-range
dependencies and learning robust feature for noisy action instances.Moreover,
an adaptive graph convolutional network (GCN) is proposed to build local
temporal context by mining the position information and difference between
adjacent features. The features from the two modules carry rich semantic
information of the video, and are fused for effective sequential proposal
generation. Extensive experiments are conducted on two challenging datasets,
THUMOS14 and ActivityNet1.3, and the results demonstrate that our method
outperforms state-of-the-art TAPG methods. Our code will be released soon.
Related papers
- Introducing Gating and Context into Temporal Action Detection [0.8987776881291144]
Temporal Action Detection (TAD) remains challenging due to action overlaps and variable action durations.
Recent findings suggest that TAD performance is dependent on the structural design of transformers rather than on the self-attention mechanism.
We propose a refined feature extraction process through lightweight, yet effective operations.
arXiv Detail & Related papers (2024-09-06T11:52:42Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Transformers in Action:Weakly Supervised Action Segmentation [81.18941007536468]
We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models.
We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time.
We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
arXiv Detail & Related papers (2022-01-14T21:15:58Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - Temporal Action Proposal Generation with Transformers [25.66256889923748]
This paper intuitively presents a unified temporal action proposal generation framework with original Transformers.
The Boundary Transformer captures long-term temporal dependencies to predict precise boundary information.
The Proposal Transformer learns the rich inter-proposal relationships for reliable confidence evaluation.
arXiv Detail & Related papers (2021-05-25T16:22:12Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Relaxed Transformer Decoders for Direct Action Proposal Generation [30.516462193231888]
This paper presents a simple and end-to-end learnable framework (RTD-Net) for direct action proposal generation.
To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR)
Experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net.
arXiv Detail & Related papers (2021-02-03T06:29:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.