OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection
- URL: http://arxiv.org/abs/2502.20361v1
- Date: Thu, 27 Feb 2025 18:32:27 GMT
- Title: OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection
- Authors: Shuming Liu, Chen Zhao, Fatimah Zohra, Mattia Soldan, Alejandro Pardo, Mengmeng Xu, Lama Alssum, Merey Ramazanova, Juan León Alcázar, Anthony Cioppa, Silvio Giancola, Carlos Hinojosa, Bernard Ghanem,
- Abstract summary: Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos.<n>We propose textbfOpenTAD, a unified TAD framework consolidating 16 different TAD methods and 9 standard datasets into a modular framework.<n>Minimal effort is required to replace one module with a different design, train a feature-based TAD model in end-to-end mode, or switch between the two.
- Score: 86.30994231610651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field has achieved remarkable progress in recent years, further progress and real-world applications are impeded by the absence of a standardized framework. Currently, different methods are compared under different implementation settings, evaluation protocols, etc., making it difficult to assess the real effectiveness of a specific technique. To address this issue, we propose \textbf{OpenTAD}, a unified TAD framework consolidating 16 different TAD methods and 9 standard datasets into a modular codebase. In OpenTAD, minimal effort is required to replace one module with a different design, train a feature-based TAD model in end-to-end mode, or switch between the two. OpenTAD also facilitates straightforward benchmarking across various datasets and enables fair and in-depth comparisons among different methods. With OpenTAD, we comprehensively study how innovations in different network components affect detection performance and identify the most effective design choices through extensive experiments. This study has led to a new state-of-the-art TAD method built upon existing techniques for each component. We have made our code and models available at https://github.com/sming256/OpenTAD.
Related papers
- SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Simplifying Source-Free Domain Adaptation for Object Detection: Effective Self-Training Strategies and Performance Insights [8.725446812770791]
This paper focuses on source-free domain adaptation for object detection in computer vision.
Recent research has proposed various solutions for Source-Free Object Detection (SFOD)
arXiv Detail & Related papers (2024-07-10T12:18:38Z) - XTrack: Multimodal Training Boosts RGB-X Video Object Trackers [88.72203975896558]
It is crucial to ensure that knowledge gained from multimodal sensing is effectively shared.<n>Similar samples across different modalities have more knowledge to share than otherwise.<n>We propose a method for RGB-X tracker during inference, with an average +3% precision improvement over the current SOTA.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection [23.91870504363899]
Double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data.
This has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems.
We introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network.
arXiv Detail & Related papers (2024-05-21T17:17:17Z) - Unified-modal Salient Object Detection via Adaptive Prompt Learning [18.90181500147265]
We propose a unified framework called UniSOD to address both single-modal and multi-modal SOD tasks.
UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning.
Our method achieves overall performance improvement on 14 benchmark datasets for RGB, RGB-D, and RGB-T SOD.
arXiv Detail & Related papers (2023-11-28T14:51:08Z) - Few-shot Event Detection: An Empirical Study and a Unified View [28.893154182743643]
Few-shot event detection (ED) has been widely studied, while this brings noticeable discrepancies.
This paper presents a thorough empirical study, a unified view of ED models, and a better unified baseline.
arXiv Detail & Related papers (2023-05-03T05:31:48Z) - BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection [46.37418710853632]
We study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD.
We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline.
This simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs.
arXiv Detail & Related papers (2022-05-05T15:42:56Z) - Benchmarking Deep Models for Salient Object Detection [67.07247772280212]
We construct a general SALient Object Detection (SALOD) benchmark to conduct a comprehensive comparison among several representative SOD methods.
In the above experiments, we find that existing loss functions usually specialized in some metrics but reported inferior results on the others.
We propose a novel Edge-Aware (EA) loss that promotes deep networks to learn more discriminative features by integrating both pixel- and image-level supervision signals.
arXiv Detail & Related papers (2022-02-07T03:43:16Z) - Visual Transformer for Task-aware Active Learning [49.903358393660724]
We present a novel pipeline for pool-based Active Learning.
Our method exploits accessible unlabelled examples during training to estimate their co-relation with the labelled examples.
Visual Transformer models non-local visual concept dependency between labelled and unlabelled examples.
arXiv Detail & Related papers (2021-06-07T17:13:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.