Related papers: VGGSounder: Audio-Visual Evaluations for Foundation Models

VGGSounder: Audio-Visual Evaluations for Foundation Models

URL: http://arxiv.org/abs/2508.08237v3
Date: Sat, 18 Oct 2025 12:43:59 GMT
Title: VGGSounder: Audio-Visual Evaluations for Foundation Models
Authors: Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke,
Abstract summary: VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification.<n>We introduce VGGSounder, a re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models.
Score: 45.98240369469408
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

Related papers

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer [60.83798235788669]
Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
arXiv Detail & Related papers (2025-06-30T08:40:36Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
DAVE: Diagnostic benchmark for Audio Visual Evaluation [43.54781776394087]
We introduce DAVE, a novel benchmark dataset designed to systematically evaluate audio-visual models.<n>DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories.<n>Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement.
arXiv Detail & Related papers (2025-03-12T12:12:46Z)
Evaluation of Deep Audio Representations for Hearables [1.5646349560044959]
This dataset includes 1,158 audio tracks, each 30 seconds long, created by spatially mixing proprietary monologues with high-quality recordings of everyday acoustic scenes.<n>Our benchmark encompasses eight tasks that assess the general context, speech sources, and technical acoustic properties of the audio scenes.<n>This superiority underscores the advantage of models trained on diverse audio collections, confirming their applicability to a wide array of auditory tasks, including encoding the environment properties necessary for hearable steering.
arXiv Detail & Related papers (2025-02-10T16:51:11Z)
Learning Self-Supervised Audio-Visual Representations for Sound Recommendations [0.0]
We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos.<n>The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams.<n>We evaluate the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes.
arXiv Detail & Related papers (2024-12-10T10:56:02Z)
Unveiling Visual Biases in Audio-Visual Localization Benchmarks [52.76903182540441]
We identify a significant issue in existing benchmarks. The sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
arXiv Detail & Related papers (2024-08-25T04:56:08Z)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks. We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z)
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z)
Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.<n>We train and evaluate state-of-the-art audio recognition and detection models on our dataset, for both audio-only and audio-visual methods.
arXiv Detail & Related papers (2023-02-01T18:19:37Z)
Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length. We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.