Dual DETRs for Multi-Label Temporal Action Detection
- URL: http://arxiv.org/abs/2404.00653v1
- Date: Sun, 31 Mar 2024 11:43:39 GMT
- Title: Dual DETRs for Multi-Label Temporal Action Detection
- Authors: Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, Limin Wang,
- Abstract summary: Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos.
We propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level.
We evaluate DualDETR on three challenging multi-label TAD benchmarks.
- Score: 46.05173000284639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos. Inspired by the success of DETR in object detection, several methods have adapted the query-based framework to the TAD task. However, these approaches primarily followed DETR to predict actions at the instance level (i.e., identify each action by its center point), leading to sub-optimal boundary localization. To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level. Decoding at different levels requires semantics of different granularity, therefore we introduce a two-branch decoding structure. This structure builds distinctive decoding processes for different levels, facilitating explicit capture of temporal cues and semantics at each level. On top of the two-branch design, we present a joint query initialization strategy to align queries from both levels. Specifically, we leverage encoder proposals to match queries from each level in a one-to-one manner. Then, the matched queries are initialized using position and content prior from the matched action proposal. The aligned dual-level queries can refine the matched proposal with complementary cues during subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD benchmarks. The experimental results demonstrate the superior performance of DualDETR to the existing state-of-the-art methods, achieving a substantial improvement under det-mAP and delivering impressive results under seg-mAP.
Related papers
- HM-Conformer: A Conformer-based audio deepfake detection system with
hierarchical pooling and multi-level classification token aggregation methods [34.83806360076228]
HM-Conformer is designed for sequence-to-sequence tasks.
It can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them.
In experimental results, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems.
arXiv Detail & Related papers (2023-09-15T07:18:30Z) - Semi-DETR: Semi-Supervised Object Detection with Detection Transformers [105.45018934087076]
We analyze the DETR-based framework on semi-supervised object detection (SSOD)
We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector.
Our method outperforms all state-of-the-art methods by clear margins.
arXiv Detail & Related papers (2023-07-16T16:32:14Z) - Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z) - PointTAD: Multi-Label Temporal Action Detection with Learnable Query
Points [28.607690605262878]
temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label.
In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video.
We extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD.
arXiv Detail & Related papers (2022-10-20T06:08:03Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Hierarchical Modeling for Task Recognition and Action Segmentation in
Weakly-Labeled Instructional Videos [6.187780920448871]
This paper focuses on task recognition and action segmentation in weakly-labeled instructional videos.
We propose a two-stream framework, which exploits semantic and temporal hierarchies to recognize top-level tasks in instructional videos.
We present a novel top-down weakly-supervised action segmentation approach, where the predicted task is used to constrain the inference of fine-grained action sequences.
arXiv Detail & Related papers (2021-10-12T02:32:15Z) - Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance
Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video.
WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event.
We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z) - VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language
Matching [75.71523183166799]
The prevailing framework for matching multimodal inputs is based on a two-stage process.
We argue that these methods overlook an obvious emphmismatch between the roles of proposals in the two stages.
We propose VL-NMS, which is the first method to yield query-aware proposals at the first stage.
arXiv Detail & Related papers (2021-05-12T13:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.