Harnessing Temporal Causality for Advanced Temporal Action Detection
- URL: http://arxiv.org/abs/2407.17792v2
- Date: Fri, 26 Jul 2024 01:16:07 GMT
- Title: Harnessing Temporal Causality for Advanced Temporal Action Detection
- Authors: Shuming Liu, Lin Sui, Chen-Lin Zhang, Fangzhou Mu, Chen Zhao, Bernard Ghanem,
- Abstract summary: We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
- Score: 53.654457142657236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a fundamental task in long-form video understanding, temporal action detection (TAD) aims to capture inherent temporal relations in untrimmed videos and identify candidate actions with precise boundaries. Over the years, various networks, including convolutions, graphs, and transformers, have been explored for effective temporal modeling for TAD. However, these modules typically treat past and future information equally, overlooking the crucial fact that changes in action boundaries are essentially causal events. Inspired by this insight, we propose leveraging the temporal causality of actions to enhance TAD representation by restricting the model's access to only past or future context. We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on multiple benchmarks. Notably, with CausalTAD, we ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, as well as 1st in the Moment Queries track at the Ego4D Challenge 2024. Our code is available at https://github.com/sming256/OpenTAD/.
Related papers
- Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features [2.8266810371534152]
Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach.
The proposed method achieves superior results compared to the other methods in both Open-vocab and Closed-vocab settings.
arXiv Detail & Related papers (2024-04-30T13:14:28Z) - JOADAA: joint online action detection and action anticipation [2.7792814152937027]
Action anticipation involves forecasting future actions by connecting past events to future ones.
Online action detection is the task of predicting actions in a streaming manner.
By combining action anticipation and online action detection, our approach can cover the missing dependencies of future information.
arXiv Detail & Related papers (2023-09-12T11:17:25Z) - Actor-identified Spatiotemporal Action Detection -- Detecting Who Is
Doing What in Videos [29.5205455437899]
Temporal Action Detection (TAD) has been investigated for estimating the start and end time for each action in videos.
Spatiotemporal Action Detection (SAD) has been studied for localizing the action both spatially and temporally in videos.
We propose a novel task, Actor-identified Spatiotemporal Action Detection (ASAD) to bridge the gap between SAD actor identification.
arXiv Detail & Related papers (2022-08-27T06:51:12Z) - Video Action Detection: Analysing Limitations and Challenges [70.01260415234127]
We analyze existing datasets on video action detection and discuss their limitations.
We perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect.
Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.
arXiv Detail & Related papers (2022-04-17T00:42:14Z) - SegTAD: Precise Temporal Action Detection via Semantic Segmentation [65.01826091117746]
We formulate the task of temporal action detection in a novel perspective of semantic segmentation.
Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free.
We propose an end-to-end framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN)
arXiv Detail & Related papers (2022-03-03T06:52:13Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Progressive Attention on Multi-Level Dense Difference Maps for Generic
Event Boundary Detection [35.16241630620967]
Generic event boundary detection is an important yet challenging task in video understanding.
This paper presents an effective and end-to-end learnable framework (DDM-Net) to tackle the diversity and complicated semantics of event boundaries.
arXiv Detail & Related papers (2021-12-09T09:00:05Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.