Relaxed Transformer Decoders for Direct Action Proposal Generation
- URL: http://arxiv.org/abs/2102.01894v1
- Date: Wed, 3 Feb 2021 06:29:28 GMT
- Title: Relaxed Transformer Decoders for Direct Action Proposal Generation
- Authors: Jing Tan, Jiaqi Tang, Limin Wang, Gangshan Wu
- Abstract summary: This paper presents a simple and end-to-end learnable framework (RTD-Net) for direct action proposal generation.
To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR)
Experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net.
- Score: 30.516462193231888
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal action proposal generation is an important and challenging task in
video understanding, which aims at detecting all temporal segments containing
action instances of interest. The existing proposal generation approaches are
generally based on pre-defined anchor windows or heuristic bottom-up boundary
matching strategies. This paper presents a simple and end-to-end learnable
framework (RTD-Net) for direct action proposal generation, by re-purposing a
Transformer-alike architecture. To tackle the essential visual difference
between time and space, we make three important improvements over the original
transformer detection framework (DETR). First, to deal with slowness prior in
videos, we replace the original Transformer encoder with a boundary attentive
module to better capture temporal information. Second, due to the ambiguous
temporal boundary and relatively sparse annotations, we present a relaxed
matching loss to relieve the strict criteria of single assignment to each
groundtruth. Finally, we devise a three-branch head to further improve the
proposal confidence estimation by explicitly predicting its completeness.
Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate
the effectiveness of RTD-Net, on both tasks of temporal action proposal
generation and temporal action detection. Moreover, due to its simplicity in
design, our RTD-Net is more efficient than previous proposal generation methods
without non-maximum suppression post-processing. The code will be available at
\url{https://github.com/MCG-NJU/RTD-Action}.
Related papers
- Faster Learning of Temporal Action Proposal via Sparse Multilevel
Boundary Generator [9.038216757761955]
Temporal action localization in videos presents significant challenges in the field of computer vision.
We propose a novel framework, Sparse Multilevel Boundary Generator (SMBG), which enhances the boundary-sensitive method with boundary classification and action completeness regression.
Our method is evaluated on two popular benchmarks, ActivityNet-1.3 and THUMOS14, and is shown to achieve state-of-the-art performance, with a better inference speed (2.47xBSN++, 2.12xDBG)
arXiv Detail & Related papers (2023-03-06T14:26:56Z) - An Efficient Spatio-Temporal Pyramid Transformer for Action Detection [40.68615998427292]
We present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) video framework for action detection.
Specifically, we propose to use local window attention to encode local-temporal rich-time representations in the early stages while applying global attention to capture long-term space-time dependencies in the later stages.
In this way, ourSTPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2022-07-21T12:38:05Z) - SegTAD: Precise Temporal Action Detection via Semantic Segmentation [65.01826091117746]
We formulate the task of temporal action detection in a novel perspective of semantic segmentation.
Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free.
We propose an end-to-end framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN)
arXiv Detail & Related papers (2022-03-03T06:52:13Z) - TransVOD: End-to-end Video Object Detection with Spatial-Temporal
Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures.
Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP.
Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z) - Adaptive Proposal Generation Network for Temporal Sentence Localization
in Videos [58.83440885457272]
We address the problem of temporal sentence localization in videos (TSLV)
Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals.
We propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency.
arXiv Detail & Related papers (2021-09-14T02:02:36Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - Temporal Action Proposal Generation with Transformers [25.66256889923748]
This paper intuitively presents a unified temporal action proposal generation framework with original Transformers.
The Boundary Transformer captures long-term temporal dependencies to predict precise boundary information.
The Proposal Transformer learns the rich inter-proposal relationships for reliable confidence evaluation.
arXiv Detail & Related papers (2021-05-25T16:22:12Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.