End-to-end Temporal Action Detection with Transformer
- URL: http://arxiv.org/abs/2106.10271v1
- Date: Fri, 18 Jun 2021 17:58:34 GMT
- Title: End-to-end Temporal Action Detection with Transformer
- Authors: Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Song Bai, Xiang Bai
- Abstract summary: Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
- Score: 86.80289146697788
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal action detection (TAD) aims to determine the semantic label and the
boundaries of every action instance in an untrimmed video. It is a fundamental
task in video understanding and significant progress has been made in TAD.
Previous methods involve multiple stages or networks and hand-designed rules or
operations, which fall short in efficiency and flexibility. Here, we construct
an end-to-end framework for TAD upon Transformer, termed \textit{TadTR}, which
simultaneously predicts all action instances as a set of labels and temporal
locations in parallel. TadTR is able to adaptively extract temporal context
information needed for making action predictions, by selectively attending to a
number of snippets in a video. It greatly simplifies the pipeline of TAD and
runs much faster than previous detectors. Our method achieves state-of-the-art
performance on HACS Segments and THUMOS14 and competitive performance on
ActivityNet-1.3. Our code will be made available at
\url{https://github.com/xlliu7/TadTR}.
Related papers
- Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features [2.8266810371534152]
Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach.
The proposed method achieves superior results compared to the other methods in both Open-vocab and Closed-vocab settings.
arXiv Detail & Related papers (2024-04-30T13:14:28Z) - TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression [25.180317527112372]
normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD)
We propose modelname, a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression.
Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors.
arXiv Detail & Related papers (2024-04-03T02:16:30Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - An Empirical Study of End-to-End Temporal Action Detection [82.64373812690127]
Temporal action detection (TAD) is an important yet challenging task in video understanding.
Rather than end-to-end learning, most existing methods adopt a head-only learning paradigm.
We validate the advantage of end-to-end learning over head-only learning and observe up to 11% performance improvement.
arXiv Detail & Related papers (2022-04-06T16:46:30Z) - SegTAD: Precise Temporal Action Detection via Semantic Segmentation [65.01826091117746]
We formulate the task of temporal action detection in a novel perspective of semantic segmentation.
Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free.
We propose an end-to-end framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN)
arXiv Detail & Related papers (2022-03-03T06:52:13Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.