Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization
- URL: http://arxiv.org/abs/2505.11217v1
- Date: Fri, 16 May 2025 13:13:25 GMT
- Title: Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization
- Authors: Yanhao Jia, Ji Xie, S Jivaganesh, Hao Li, Xu Wu, Mengmi Zhang,
- Abstract summary: Humans consistently outperform AI, demonstrating superior resilience to conflicting or missing visuals by relying on auditory information.<n>We finetune a state-of-the-art model using a stereo audio-image dataset generated via 3D simulations.
- Score: 9.726581483075938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imagine hearing a dog bark and turning toward the sound only to see a parked car, while the real, silent dog sits elsewhere. Such sensory conflicts test perception, yet humans reliably resolve them by prioritizing sound over misleading visuals. Despite advances in multimodal AI integrating vision and audio, little is known about how these systems handle cross-modal conflicts or whether they favor one modality. In this study, we systematically examine modality bias and conflict resolution in AI sound localization. We assess leading multimodal models and benchmark them against human performance in psychophysics experiments across six audiovisual conditions, including congruent, conflicting, and absent cues. Humans consistently outperform AI, demonstrating superior resilience to conflicting or missing visuals by relying on auditory information. In contrast, AI models often default to visual input, degrading performance to near chance levels. To address this, we finetune a state-of-the-art model using a stereo audio-image dataset generated via 3D simulations. Even with limited training data, the refined model surpasses existing benchmarks. Notably, it also mirrors human-like horizontal localization bias favoring left-right precision-likely due to the stereo audio structure reflecting human ear placement. These findings underscore how sensory input quality and system architecture shape multimodal representation accuracy.
Related papers
- Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound [46.7144966835279]
This paper addresses the need for automated systems capable of predicting audio aesthetics without human intervention.<n>We propose new annotation guidelines that decompose human listening perspectives into four distinct axes.<n>We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality.
arXiv Detail & Related papers (2025-02-07T18:15:57Z) - VAGUE: Visual Contexts Clarify Ambiguous Expressions [15.140825578254324]
We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent.<n>VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations.<n>Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent.
arXiv Detail & Related papers (2024-11-21T14:01:42Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Sound Localization from Motion: Jointly Learning Sound Direction and
Camera Rotation [26.867430697990674]
We use images and sounds that undergo subtle but geometrically consistent changes as we rotate our heads to estimate camera rotation and localizing sound sources.
A visual model predicts camera rotation from a pair of images, while an audio model predicts the direction of sound sources from sounds.
We train these models to generate predictions that agree with one another.
Our model can successfully estimate rotations on both real and synthetic scenes, and localize sound sources with accuracy competitive with state-of-the-art self-supervised approaches.
arXiv Detail & Related papers (2023-03-20T17:59:55Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Audio-visual Representation Learning for Anomaly Events Detection in
Crowds [119.72951028190586]
This paper attempts to exploit multi-modal learning for modeling the audio and visual signals simultaneously.
We conduct the experiments on SHADE dataset, a synthetic audio-visual dataset in surveillance scenes.
We find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-10-28T02:42:48Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.