The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024
- URL: http://arxiv.org/abs/2410.09088v1
- Date: Tue, 8 Oct 2024 01:07:21 GMT
- Title: The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024
- Authors: Yinan Han, Qingyuan Jiang, Hongming Mei, Yang Yang, Jinhui Tang,
- Abstract summary: TAL focuses on identifying and classifying actions within specific time intervals throughout a video sequence.
We employ a data augmentation technique by expanding the training dataset using overlapping labels from the Something-SomethingV2 dataset.
For feature extraction, we utilize state-of-the-art models, including UMT, VideoMAEv2 for video features, and BEATs and CAV-MAE for audio features.
- Score: 27.30100635072298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report presents our method for Temporal Action Localisation (TAL), which focuses on identifying and classifying actions within specific time intervals throughout a video sequence. We employ a data augmentation technique by expanding the training dataset using overlapping labels from the Something-SomethingV2 dataset, enhancing the model's ability to generalize across various action classes. For feature extraction, we utilize state-of-the-art models, including UMT, VideoMAEv2 for video features, and BEATs and CAV-MAE for audio features. Our approach involves training both multimodal (video and audio) and unimodal (video only) models, followed by combining their predictions using the Weighted Box Fusion (WBF) method. This fusion strategy ensures robust action localisation. our overall approach achieves a score of 0.5498, securing first place in the competition.
Related papers
- JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling [8.463489896549161]
Two-stage Video localization (VAD) is a formidable task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip.
We propose a two-stage VAD framework called Joint Actor-scene context Relation modeling (JARViS)
JARViS consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention.
arXiv Detail & Related papers (2024-08-07T08:08:08Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation [9.93719767430551]
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABA6 competition.
We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features.
We employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability.
arXiv Detail & Related papers (2024-03-19T04:25:54Z) - Adversarial Augmentation Training Makes Action Recognition Models More
Robust to Realistic Video Distribution Shifts [13.752169303624147]
Action recognition models often lack robustness when faced with natural distribution shifts between training and test data.
We propose two novel evaluation methods to assess model resilience to such distribution disparity.
We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models.
arXiv Detail & Related papers (2024-01-21T05:50:39Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - End-to-End Semi-Supervised Learning for Video Action Detection [23.042410033982193]
We propose a simple end-to-end based approach effectively which utilizes the unlabeled data.
Video action detection requires both, action class prediction as well as a-temporal consistency.
We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets.
arXiv Detail & Related papers (2022-03-08T18:11:25Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Temporal Fusion Network for Temporal Action Localization:Submission to
ActivityNet Challenge 2020 (Task E) [45.3218136336925]
This report analyzes a temporal action localization method we used in the HACS competition which is hosted in Activitynet Challenge 2020.
The goal of our task is to locate the start time and end time of the action in the untrimmed video, and predict action category.
By fusing the results of multiple models, our method obtains 40.55% on the validation set and 40.53% on the test set in terms of mAP, and achieves Rank 1 in this challenge.
arXiv Detail & Related papers (2020-06-13T00:33:00Z) - Fast Template Matching and Update for Video Object Tracking and
Segmentation [56.465510428878]
The main task we aim to tackle is the multi-instance semi-supervised video object segmentation across a sequence of frames.
The challenges lie in the selection of the matching method to predict the result as well as to decide whether to update the target template.
We propose a novel approach which utilizes reinforcement learning to make these two decisions at the same time.
arXiv Detail & Related papers (2020-04-16T08:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.