Do Audio-Visual Large Language Models Really See and Hear?
Abstract Overview
This paper presents the first mechanistic interpretability study of Audio-Visual Large Language Models (AVLLMs), analyzing how audio and visual representations evolve and fuse across transformer layers during caption generation. Using a curated evaluation set of 500 factual and counterfactual audio-visual samples sourced from AudioCaps, the authors demonstrate that audio understanding degrades by up to 56% when audio conflicts with vision, revealing a strong visual bias. Attention analysis shows that AVLLMs attend substantially to audio in early layers (40-50% in layers 0-5) but this drops to near-zero in deeper layers, while visual attention steadily increases. Logit-lens probing of intermediate representations reveals meaningful latent audio semantics that fail to surface in final text outputs, and causal attention knockout interventions show that blocking visual pathways in deep layers recovers approximately 50% relative audio performance. Comparison of output token distributions with the base vision-language model (Qwen2.5VL) indicates that the AVLLM's generation remains heavily vision-driven, suggesting the bias stems from inherited training priors.
Novelty
The paper is presented as the first mechanistic interpretability analysis of AVLLMs. Its distinctive contribution is to combine attention pattern analysis, logit-lens probing of audio token representations, causal attention knockout interventions, and distributional comparison with the base vision-language model to systematically trace where audio information is encoded, preserved, and ultimately suppressed by visual priors during generation.
Results
The study reports that AVLLM audio understanding drops by up to 56% on counterfactual samples where audio and vision conflict. For Qwen2.5-Omni, latent audio understanding from probed internal representations reaches 61.4% even though audio caption fidelity on counterfactual samples is only 23%, and blocking visual pathways in deep layers recovers approximately 50% relative audio performance. The AVLLM's output distribution remains close to its base LVLM (KL divergence of 0.4), with 85.36% of audio-related tokens falling within the top three ranks of the vision-only model's predictions, supporting the conclusion that generation remains heavily vision-driven.
Key Points
- AVLLMs allocate high attention to audio tokens in early layers (40-50% in layers 0-5), but this attention drops to near-zero in deeper layers while visual token attention steadily increases to 20-40% in layers 15-30, creating a systematic cross-modal asymmetry.
- Logit-lens probing reveals that intermediate audio representations decode into meaningful sound-related concepts (e.g., sound sources and events), achieving 61.4% latent audio understanding even when final caption audio fidelity is only 23%, indicating the failure lies in generation rather than representation.
- Comparison with the base vision-language model (Qwen2.5VL) shows that 85.36% of audio-related tokens generated by the AVLLM are predictable from the vision-only model's top three choices, suggesting the observed visual dominance likely originates from inherited training priors or vision-heavy alignment data.