JOADAA: joint online action detection and action anticipation
- URL: http://arxiv.org/abs/2309.06130v1
- Date: Tue, 12 Sep 2023 11:17:25 GMT
- Title: JOADAA: joint online action detection and action anticipation
- Authors: Mohammed Guermal, Francois Bremond, Rui Dai, Abid Ali
- Abstract summary: Action anticipation involves forecasting future actions by connecting past events to future ones.
Online action detection is the task of predicting actions in a streaming manner.
By combining action anticipation and online action detection, our approach can cover the missing dependencies of future information.
- Score: 2.7792814152937027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Action anticipation involves forecasting future actions by connecting past
events to future ones. However, this reasoning ignores the real-life hierarchy
of events which is considered to be composed of three main parts: past,
present, and future. We argue that considering these three main parts and their
dependencies could improve performance. On the other hand, online action
detection is the task of predicting actions in a streaming manner. In this
case, one has access only to the past and present information. Therefore, in
online action detection (OAD) the existing approaches miss semantics or future
information which limits their performance. To sum up, for both of these tasks,
the complete set of knowledge (past-present-future) is missing, which makes it
challenging to infer action dependencies, therefore having low performances. To
address this limitation, we propose to fuse both tasks into a single uniform
architecture. By combining action anticipation and online action detection, our
approach can cover the missing dependencies of future information in online
action detection. This method referred to as JOADAA, presents a uniform model
that jointly performs action anticipation and online action detection. We
validate our proposed model on three challenging datasets: THUMOS'14, which is
a sparsely annotated dataset with one action per time step, CHARADES, and
Multi-THUMOS, two densely annotated datasets with more complex scenarios.
JOADAA achieves SOTA results on these benchmarks for both tasks.
Related papers
- About Time: Advances, Challenges, and Outlooks of Action Understanding [57.76390141287026]
This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks.
We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances.
arXiv Detail & Related papers (2024-11-22T18:09:27Z) - From Recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation [30.161471749050833]
We propose a novel end-to-end video modeling architecture that utilizes attention mechanisms, named Anticipation via Recognition and Reasoning (ARR)
ARR decomposes the action anticipation task into action recognition and reasoning tasks, and effectively learns the statistical relationship between actions by next action prediction (NAP)
In addition, to address the challenge of relationship modeling that requires extensive training data, we propose an innovative approach for the unsupervised pre-training of the decoder.
arXiv Detail & Related papers (2024-08-05T18:38:29Z) - Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Actor-identified Spatiotemporal Action Detection -- Detecting Who Is
Doing What in Videos [29.5205455437899]
Temporal Action Detection (TAD) has been investigated for estimating the start and end time for each action in videos.
Spatiotemporal Action Detection (SAD) has been studied for localizing the action both spatially and temporally in videos.
We propose a novel task, Actor-identified Spatiotemporal Action Detection (ASAD) to bridge the gap between SAD actor identification.
arXiv Detail & Related papers (2022-08-27T06:51:12Z) - You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory
Prediction [52.442129609979794]
Recent deep learning approaches for trajectory prediction show promising performance.
It remains unclear which features such black-box models actually learn to use for making predictions.
This paper proposes a procedure that quantifies the contributions of different cues to model performance.
arXiv Detail & Related papers (2021-10-11T14:24:15Z) - Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video [27.391434284586985]
Rolling-Unrolling LSTM is a learning architecture to anticipate actions from egocentric videos.
The proposed approach is validated on EPIC-Kitchens, EGTEA Gaze+ and ActivityNet.
arXiv Detail & Related papers (2020-05-04T14:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.