Related papers: When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models

When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models

URL: http://arxiv.org/abs/2508.15407v1
Date: Thu, 21 Aug 2025 09:58:24 GMT
Title: When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
Authors: Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang,
Abstract summary: MCR-BENCH is the first benchmark designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs.<n>We reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input.<n>This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications.
Score: 18.160420407067743
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.

Related papers

CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation [32.72685791637924]
We propose CORD, a unified alignment framework that performs online cross-modal self-distillation.<n>Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model.<n> Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning.
arXiv Detail & Related papers (2026-01-23T08:31:24Z)
AEQ-Bench: Measuring Empathy of Omni-Modal Large Models [55.722881748046895]
We introduce AEQ-Bench, a novel benchmark to assess two core empathetic capabilities of omni-modal large models (OLMs)<n>AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone.<n> Comprehensive assessment across linguistic and paralinguistic metrics reveals that OLMs trained with audio output capabilities generally outperformed models with text-only outputs.
arXiv Detail & Related papers (2026-01-15T15:39:50Z)
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models [48.94367629342966]
We find that even non-informative audio reduces accuracy and increases prediction volatility.<n> Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise.<n>Our results reveal cross-modal interference as a key challenge and highlight the need for efficient fusion strategies.
arXiv Detail & Related papers (2025-10-01T07:59:45Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment [26.399212357764576]
Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation.<n>We propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module.<n>AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues.<n>UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state.
arXiv Detail & Related papers (2025-03-17T05:48:22Z)
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities. MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.