BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection
- URL: http://arxiv.org/abs/2205.02717v3
- Date: Mon, 10 Apr 2023 14:57:34 GMT
- Title: BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection
- Authors: Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, Limin Wang
- Abstract summary: We study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD.
We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline.
This simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs.
- Score: 46.37418710853632
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Temporal action detection (TAD) is extensively studied in the video
understanding community by generally following the object detection pipeline in
images. However, complex designs are not uncommon in TAD, such as two-stream
feature extraction, multi-stage training, complex temporal modeling, and global
context fusion. In this paper, we do not aim to introduce any novel technique
for TAD. Instead, we study a simple, straightforward, yet must-known baseline
given the current status of complex design and low detection efficiency in TAD.
In our simple baseline (termed BasicTAD), we decompose the TAD pipeline into
several essential components: data sampling, backbone design, neck
construction, and detection head. We extensively investigate the existing
techniques in each component for this baseline, and more importantly, perform
end-to-end training over the entire pipeline thanks to the simplicity of
design. As a result, this simple BasicTAD yields an astounding and real-time
RGB-Only baseline very close to the state-of-the-art methods with two-stream
inputs. In addition, we further improve the BasicTAD by preserving more
temporal and spatial information in network representation (termed as PlusTAD).
Empirical results demonstrate that our PlusTAD is very efficient and
significantly outperforms the previous methods on the datasets of THUMOS14 and
FineAction. Meanwhile, we also perform in-depth visualization and error
analysis on our proposed method and try to provide more insights on the TAD
problem. Our approach can serve as a strong baseline for future TAD research.
The code and model will be released at https://github.com/MCG-NJU/BasicTAD.
Related papers
- Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection [128.40330044868293]
Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains.
ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets.
arXiv Detail & Related papers (2023-12-12T18:28:59Z) - Few Clicks Suffice: Active Test-Time Adaptation for Semantic
Segmentation [14.112999441288615]
Test-time adaptation (TTA) adapts pre-trained models during inference using unlabeled test data.
There is still a significant performance gap between the TTA approaches and their supervised counterparts.
We propose ATASeg framework, which consists of two parts, i.e., model adapter and label annotator.
arXiv Detail & Related papers (2023-12-04T12:16:02Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - Densely Nested Top-Down Flows for Salient Object Detection [137.74130900326833]
This paper revisits the role of top-down modeling in salient object detection.
It designs a novel densely nested top-down flows (DNTDF)-based framework.
In every stage of DNTDF, features from higher levels are read in via the progressive compression shortcut paths (PCSP)
arXiv Detail & Related papers (2021-02-18T03:14:02Z) - The Little W-Net That Could: State-of-the-Art Retinal Vessel
Segmentation with Minimalistic Models [19.089445797922316]
We show that a minimalistic version of a standard U-Net with several orders of magnitude less parameters closely approximates the performance of current best techniques.
We also propose a simple extension, dubbed W-Net, which reaches outstanding performance on several popular datasets.
We also test our approach on the Artery/Vein segmentation problem, where we again achieve results well-aligned with the state-of-the-art.
arXiv Detail & Related papers (2020-09-03T19:59:51Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.