When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
- URL: http://arxiv.org/abs/2510.00626v1
- Date: Wed, 01 Oct 2025 07:59:45 GMT
- Title: When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
- Authors: Chen-An Li, Tzu-Han Lin, Hung-yi Lee,
- Abstract summary: We find that even non-informative audio reduces accuracy and increases prediction volatility.<n> Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise.<n>Our results reveal cross-modal interference as a key challenge and highlight the need for efficient fusion strategies.
- Score: 48.94367629342966
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.
Related papers
- When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper [0.0]
We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, when used as a preprocessing step for zero-shot transcription with Whisper.<n> Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance.<n>These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition.
arXiv Detail & Related papers (2026-03-05T01:20:11Z) - Lost in the Noise: How Reasoning Models Fail with Contextual Distractors [57.31788955167306]
Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information.<n>We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks.<n>Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors.
arXiv Detail & Related papers (2026-01-12T05:43:51Z) - Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models [49.097347801692166]
We introduce Thinking-with-Sound (TwS), a framework that equips Large Audio-Language Models with Audio CoT.<n>TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning.<n>Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50%$ compared to clean audio.
arXiv Detail & Related papers (2025-09-26T01:27:59Z) - SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions [54.34001921326444]
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems.<n>Existing benchmarks evaluate only subsets of these conditions, missing others entirely.<n>We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks.
arXiv Detail & Related papers (2025-09-21T14:11:16Z) - When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models [18.160420407067743]
MCR-BENCH is the first benchmark designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs.<n>We reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input.<n>This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications.
arXiv Detail & Related papers (2025-08-21T09:58:24Z) - Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers [40.4026420070893]
We introduce Hidden in the Noise (HIN), a novel backdoor attack framework designed to exploit subtle, audio-specific features.<n>HIN applies acoustic modifications to raw audio waveforms, such as alterations to temporal dynamics and strategic injection of spectrally tailored noise.<n>To evaluate ALLM robustness against audio-feature-based triggers, we develop the AudioSafe benchmark, assessing nine distinct risk types.
arXiv Detail & Related papers (2025-08-04T08:15:16Z) - Autoregressive Speech Enhancement via Acoustic Tokens [12.77742493025067]
We study the performance of acoustic tokens for speech enhancement and introduce a novel transducer-based autoregressive architecture.<n>Experiments on VoiceBank and Libri1 datasets show that acoustic tokens outperform semantic tokens in terms of preserving speaker identity.
arXiv Detail & Related papers (2025-07-17T06:32:22Z) - Measuring the Robustness of Audio Deepfake Detectors [59.09338266364506]
This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions.<n>Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations.
arXiv Detail & Related papers (2025-03-21T23:21:17Z) - Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation [8.170174172545831]
This paper addresses issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024.
We present an evaluation protocol combining objective metric, namely Fr'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation.
arXiv Detail & Related papers (2024-10-23T06:35:41Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS [43.84833978193758]
Zero-shot text-to-speech (TTS) systems are capable of synthesizing any speaker's voice from a short audio prompt.
The quality of the generated speech significantly deteriorates when the audio prompt contains noise.
In this paper, we explore various strategies to enhance the quality of audio generated from noisy audio prompts.
arXiv Detail & Related papers (2024-06-09T08:51:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.