Related papers: TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression

TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression

URL: http://arxiv.org/abs/2404.02405v2
Date: Thu, 4 Apr 2024 02:56:00 GMT
Title: TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression
Authors: Ho-Joong Kim, Jung-Ho Hong, Heejo Kong, Seong-Whan Lee,
Abstract summary: normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD) We propose modelname, a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression. Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors.
Score: 25.180317527112372
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we investigate that the normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD). Despite significant advancements towards an end-to-end framework in object detection, query-based detectors have been limited in achieving full end-to-end modeling in TAD. To address this issue, we propose \modelname{}, a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression. We reformulate coordinate expression utilizing actual timeline values, ensuring length-invariant representations from the extremely diverse video duration environment. Furthermore, our proposed adaptive query selection dynamically adjusts the number of queries based on video length, providing a suitable solution for varying video durations compared to a fixed query set. Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors. Our TE-TAD outperforms the previous query-based detectors and achieves competitive performance compared to state-of-the-art methods on popular benchmark datasets. Code is available at: https://github.com/Dotori-HJ/TE-TAD

Related papers

Action tube generation by person query matching for spatio-temporal action detection [0.0]
Method generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip.
arXiv Detail & Related papers (2025-03-17T09:26:06Z)
A Flexible and Scalable Framework for Video Moment Search [51.47907684209207]
This paper introduces a flexible framework for retrieving a ranked list of moments from collection of videos in any length to match a text query. Our framework, called Segment-Proposal-Ranking (SPR), simplifies the search process into three independent stages: segment retrieval, proposal generation, and moment refinement with re-ranking. Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves state-of-the-art performance with significant reductions in computational cost and processing time.
arXiv Detail & Related papers (2025-01-09T08:54:19Z)
Query matching for spatio-temporal action detection with query-based object detector [0.0]
We propose a method that extends the query-based object detection model, DETR, to maintain temporal consistency in videos. Our method applies DETR to each frame and uses feature shift to incorporate temporal information. To overcome this issue, we propose query matching across different frames, ensuring that queries for the same object are matched and used for the feature shift.
arXiv Detail & Related papers (2024-09-27T02:54:24Z)
Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection [8.74967598360817]
Key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to a given text query. Recent transformer-based models do not fully exploit the information of a given query. We introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD.
arXiv Detail & Related papers (2023-03-24T09:32:50Z)
TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z)
Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories. We introduce Sparse Proposals to interact with the hierarchical features. Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z)
End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR. Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers. We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.