Video Action Detection: Analysing Limitations and Challenges
- URL: http://arxiv.org/abs/2204.07892v1
- Date: Sun, 17 Apr 2022 00:42:14 GMT
- Title: Video Action Detection: Analysing Limitations and Challenges
- Authors: Rajat Modi, Aayush Jung Rana, Akash Kumar, Praveen Tirupattur, Shruti
Vyas, Yogesh Singh Rawat, Mubarak Shah
- Abstract summary: We analyze existing datasets on video action detection and discuss their limitations.
We perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect.
Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.
- Score: 70.01260415234127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Beyond possessing large enough size to feed data hungry machines (eg,
transformers), what attributes measure the quality of a dataset? Assuming that
the definitions of such attributes do exist, how do we quantify among their
relative existences? Our work attempts to explore these questions for video
action detection. The task aims to spatio-temporally localize an actor and
assign a relevant action class. We first analyze the existing datasets on video
action detection and discuss their limitations. Next, we propose a new dataset,
Multi Actor Multi Action (MAMA) which overcomes these limitations and is more
suitable for real world applications. In addition, we perform a biasness study
which analyzes a key property differentiating videos from static images: the
temporal aspect. This reveals if the actions in these datasets really need the
motion information of an actor, or whether they predict the occurrence of an
action even by looking at a single frame. Finally, we investigate the widely
held assumptions on the importance of temporal ordering: is temporal ordering
important for detecting these actions? Such extreme experiments show existence
of biases which have managed to creep into existing methods inspite of careful
modeling.
Related papers
- Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - Boundary-Denoising for Video Activity Localization [57.9973253014712]
We study the video activity localization problem from a denoising perspective.
Specifically, we propose an encoder-decoder model named DenoiseLoc.
Experiments show that DenoiseLoc advances %in several video activity understanding tasks.
arXiv Detail & Related papers (2023-04-06T08:48:01Z) - A Multi-Person Video Dataset Annotation Method of Spatio-Temporally
Actions [4.49302950538123]
We use to crop videos and frame videos; then use yolov5 to detect human in the video frame, and then use deep sort to detect the ID of the human in the video frame.
arXiv Detail & Related papers (2022-04-21T15:14:02Z) - Sequence-to-Sequence Modeling for Action Identification at High Temporal
Resolution [9.902223920743872]
We introduce a new action-recognition benchmark that includes subtle short-duration actions labeled at a high temporal resolution.
We show that current state-of-the-art models based on segmentation produce noisy predictions when applied to these data.
We propose a novel approach for high-resolution action identification, inspired by speech-recognition techniques.
arXiv Detail & Related papers (2021-11-03T21:06:36Z) - Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z) - FineAction: A Fined Video Dataset for Temporal Action Localization [60.90129329728657]
FineAction is a new large-scale fined video dataset collected from existing video datasets and web videos.
This dataset contains 139K fined action instances densely annotated in almost 17K untrimmed videos spanning 106 action categories.
Experimental results reveal that our FineAction brings new challenges for action localization on fined and multi-label instances with shorter duration.
arXiv Detail & Related papers (2021-05-24T06:06:32Z) - Activity Graph Transformer for Temporal Action Localization [41.69734359113706]
We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization.
In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs.
Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.
arXiv Detail & Related papers (2021-01-21T10:42:48Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - Gabriella: An Online System for Real-Time Activity Detection in
Untrimmed Security Videos [72.50607929306058]
We propose a real-time online system to perform activity detection on untrimmed security videos.
The proposed method consists of three stages: tubelet extraction, activity classification and online tubelet merging.
We demonstrate the effectiveness of the proposed approach in terms of speed (100 fps) and performance with state-of-the-art results.
arXiv Detail & Related papers (2020-04-23T22:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.