Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action
Localization
- URL: http://arxiv.org/abs/2106.14118v1
- Date: Sun, 27 Jun 2021 00:49:02 GMT
- Title: Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action
Localization
- Authors: Anurag Bagchi, Jazib Mahmood, Dolton Fernandes, Ravi Kiran
Sarvadevabhatla
- Abstract summary: We propose simple but effective fusion-based approaches for TAL.
We experimentally show that our schemes consistently improve performance for state of the art video-only TAL approaches.
- Score: 7.577219401804674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State of the art architectures for untrimmed video Temporal Action
Localization (TAL) have only considered RGB and Flow modalities, leaving the
information-rich audio modality totally unexploited. Audio fusion has been
explored for the related but arguably easier problem of trimmed (clip-level)
action recognition. However, TAL poses a unique set of challenges. In this
paper, we propose simple but effective fusion-based approaches for TAL. To the
best of our knowledge, our work is the first to jointly consider audio and
video modalities for supervised TAL. We experimentally show that our schemes
consistently improve performance for state of the art video-only TAL
approaches. Specifically, they help achieve new state of the art performance on
large-scale benchmark datasets - ActivityNet-1.3 (52.73 mAP@0.5) and THUMOS14
(57.18 mAP@0.5). Our experiments include ablations involving multiple fusion
schemes, modality combinations and TAL architectures. Our code, models and
associated data will be made available.
Related papers
- Efficient Audio-Visual Fusion for Video Classification [6.106447284305316]
We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks.
Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture.
arXiv Detail & Related papers (2024-11-08T14:47:28Z) - The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024 [27.30100635072298]
TAL focuses on identifying and classifying actions within specific time intervals throughout a video sequence.
We employ a data augmentation technique by expanding the training dataset using overlapping labels from the Something-SomethingV2 dataset.
For feature extraction, we utilize state-of-the-art models, including UMT, VideoMAEv2 for video features, and BEATs and CAV-MAE for audio features.
arXiv Detail & Related papers (2024-10-08T01:07:21Z) - Centre Stage: Centricity-based Audio-Visual Temporal Action Detection [26.42447737005981]
We explore strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities.
We propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score.
arXiv Detail & Related papers (2023-11-28T03:02:00Z) - ONE-PEACE: Exploring One General Representation Model Toward Unlimited
Modalities [71.15303690248021]
We release ONE-PEACE, a highly model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities.
The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs.
With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities.
arXiv Detail & Related papers (2023-05-18T17:59:06Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via
Audiovisual Temporal Context [58.932717614439916]
We take a deep look into the effectiveness of audio in detecting actions in egocentric videos.
We propose a transformer-based model to incorporate temporal audio-visual context.
Our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
arXiv Detail & Related papers (2022-02-10T10:50:52Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Mutual Modality Learning for Video Action Classification [74.83718206963579]
We show how to embed multi-modality into a single model for video action classification.
We achieve state-of-the-art results in the Something-Something-v2 benchmark.
arXiv Detail & Related papers (2020-11-04T21:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.