DAVE: Diagnostic benchmark for Audio Visual Evaluation
- URL: http://arxiv.org/abs/2503.09321v1
- Date: Wed, 12 Mar 2025 12:12:46 GMT
- Title: DAVE: Diagnostic benchmark for Audio Visual Evaluation
- Authors: Gorjan Radevski, Teodora Popordanoska, Matthew B. Blaschko, Tinne Tuytelaars,
- Abstract summary: We introduce DAVE, a novel benchmark dataset designed to systematically evaluate audio-visual models.<n>DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories.<n>Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement.
- Score: 43.54781776394087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- where answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: https://github.com/gorjanradevski/dave
Related papers
- $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)
MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.
To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning [31.61978841892981]
We introduce a novel dataset, FortisAVQA, constructed in two stages.
The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation.
Our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81%.
arXiv Detail & Related papers (2025-04-01T07:23:50Z) - Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment [26.399212357764576]
Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation.
We propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module.
AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues.
UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state.
arXiv Detail & Related papers (2025-03-17T05:48:22Z) - Unveiling Visual Biases in Audio-Visual Localization Benchmarks [52.76903182540441]
We identify a significant issue in existing benchmarks.
The sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias.
Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
arXiv Detail & Related papers (2024-08-25T04:56:08Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event
Parser [34.19935635508947]
We investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed.
To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers.
A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events.
arXiv Detail & Related papers (2023-05-27T02:57:39Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention
Estimation for Non-Profilic Faces [28.245662058349854]
In this paper, we explore if audio-guided coarse head-pose can further enhance visual attention estimation performance for non-prolific faces.
We use off-the-shelf state-of-the-art models to facilitate cross-modal weak-supervision.
Our model can utilize any of the available modalities for task-specific inference.
arXiv Detail & Related papers (2022-07-07T02:23:02Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.