Investigating Modality Bias in Audio Visual Video Parsing
- URL: http://arxiv.org/abs/2203.16860v1
- Date: Thu, 31 Mar 2022 07:43:01 GMT
- Title: Investigating Modality Bias in Audio Visual Video Parsing
- Authors: Piyush Singh Pasi, Shubham Nemani, Preethi Jyothi, Ganesh Ramakrishnan
- Abstract summary: We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual event labels with temporal boundaries.
An existing state-of-the-art model for AVVP uses a hybrid attention network (HAN) to generate cross-modal features for both audio and visual modalities.
We propose a variant of feature aggregation in HAN that leads to an absolute gain in F-scores of about 2% and 1.6% for visual and audio-visual events at both segment-level and event-level.
- Score: 31.83076679253096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We focus on the audio-visual video parsing (AVVP) problem that involves
detecting audio and visual event labels with temporal boundaries. The task is
especially challenging since it is weakly supervised with only event labels
available as a bag of labels for each video. An existing state-of-the-art model
for AVVP uses a hybrid attention network (HAN) to generate cross-modal features
for both audio and visual modalities, and an attentive pooling module that
aggregates predicted audio and visual segment-level event probabilities to
yield video-level event probabilities. We provide a detailed analysis of
modality bias in the existing HAN architecture, where a modality is completely
ignored during prediction. We also propose a variant of feature aggregation in
HAN that leads to an absolute gain in F-scores of about 2% and 1.6% for visual
and audio-visual events at both segment-level and event-level, in comparison to
the existing HAN model.
Related papers
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing [23.100602876056165]
Weakly supervised audio-visual video parsing methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels.
We propose CoLeaF, a novel learning framework that optimize the integration of cross-modal context in the embedding space.
Our experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the datasets.
arXiv Detail & Related papers (2024-05-17T10:51:15Z) - Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video
Parsing [58.9467115916639]
We propose a messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion.
The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information.
We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction.
arXiv Detail & Related papers (2023-11-14T13:27:03Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale
Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events.
Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z) - Audio-visual Representation Learning for Anomaly Events Detection in
Crowds [119.72951028190586]
This paper attempts to exploit multi-modal learning for modeling the audio and visual signals simultaneously.
We conduct the experiments on SHADE dataset, a synthetic audio-visual dataset in surveillance scenes.
We find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-10-28T02:42:48Z) - Cross-Modal learning for Audio-Visual Video Parsing [30.331280948237428]
We present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities.
We show how AVVP can benefit from the following techniques geared towards effective cross-modal learning.
arXiv Detail & Related papers (2021-04-03T07:07:21Z) - Positive Sample Propagation along the Audio-Visual Event Line [29.25572713908162]
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs)
We propose a new positive sample propagation (PSP) module to discover and exploit closely related audio-visual pairs.
We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings.
arXiv Detail & Related papers (2021-04-01T03:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.