Related papers: End-to-end Temporal Action Detection with Transformer

End-to-end Temporal Action Detection with Transformer

URL: http://arxiv.org/abs/2106.10271v1
Date: Fri, 18 Jun 2021 17:58:34 GMT
Title: End-to-end Temporal Action Detection with Transformer
Authors: Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Song Bai, Xiang Bai
Abstract summary: Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR. Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
Score: 86.80289146697788
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. It is a fundamental task in video understanding and significant progress has been made in TAD. Previous methods involve multiple stages or networks and hand-designed rules or operations, which fall short in efficiency and flexibility. Here, we construct an end-to-end framework for TAD upon Transformer, termed \textit{TadTR}, which simultaneously predicts all action instances as a set of labels and temporal locations in parallel. TadTR is able to adaptively extract temporal context information needed for making action predictions, by selectively attending to a number of snippets in a video. It greatly simplifies the pipeline of TAD and runs much faster than previous detectors. Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3. Our code will be made available at \url{https://github.com/xlliu7/TadTR}.

Related papers

Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos. Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z)
Online Temporal Action Localization with Memory-Augmented Transformer [61.39427407758131]
We propose a memory-augmented transformer (MATR) for online temporal action localization. MATR selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action.
arXiv Detail & Related papers (2024-08-06T04:55:33Z)
Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks. We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z)
One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features [2.8266810371534152]
Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach. The proposed method achieves superior results compared to the other methods in both Open-vocab and Closed-vocab settings.
arXiv Detail & Related papers (2024-04-30T13:14:28Z)
TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression [25.180317527112372]
normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD) We propose modelname, a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression. Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors.
arXiv Detail & Related papers (2024-04-03T02:16:30Z)
Progression-Guided Temporal Action Detection in Videos [20.02711550239915]
We present a novel framework, Action Progression Network (APN), for temporal action detection (TAD) in videos. The framework locates actions in videos by detecting the action evolution process. We quantify a complete action process into 101 ordered stages and train a neural network to recognize the action progressions.
arXiv Detail & Related papers (2023-08-18T03:14:05Z)
HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video. We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video. We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z)
An Empirical Study of End-to-End Temporal Action Detection [82.64373812690127]
Temporal action detection (TAD) is an important yet challenging task in video understanding. Rather than end-to-end learning, most existing methods adopt a head-only learning paradigm. We validate the advantage of end-to-end learning over head-only learning and observe up to 11% performance improvement.
arXiv Detail & Related papers (2022-04-06T16:46:30Z)
SegTAD: Precise Temporal Action Detection via Semantic Segmentation [65.01826091117746]
We formulate the task of temporal action detection in a novel perspective of semantic segmentation. Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free. We propose an end-to-end framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN)
arXiv Detail & Related papers (2022-03-03T06:52:13Z)
Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG. Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer. An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z)
End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.