Exploring Audio Hallucination in Egocentric Video Understanding
Abstract Overview
This paper studies audio hallucination in egocentric video understanding, where audio-visual language models (AV-LLMs) describe sounds that are visually implied but not actually present in the audio. The authors introduce a systematic evaluation framework based on targeted question answering, using a curated benchmark of 300 Ego4D video clips and 1,000 manually reviewed sound-focused Q/A pairs. Their framework distinguishes foreground action sounds produced by the camera wearer from background ambient sounds, enabling fine-grained analysis of hallucination behavior. Experiments on four AV-LLMs show that these models often rely on visual context rather than grounding their responses in the audio signal, with hallucination detection accuracy substantially lower than factual Q/A accuracy across all models tested.
Novelty
The paper presents the first taxonomy-driven evaluation specifically focused on audio hallucinations in egocentric videos. Its main distinctive contribution is a source-grounded benchmark and Q/A protocol that separates foreground action sounds from background ambient sounds while explicitly testing for plausible but absent audio events through dedicated hallucination Q/A pairs.
Results
Across four AV-LLMs, hallucination detection accuracy is substantially lower than factual Q/A accuracy. Qwen2.5 Omni, the strongest model tested, achieves only 27.3% and 39.5% accuracy on foreground and background hallucination Q/A, respectively, despite higher factual accuracies of 56.2% and 63.4%. Qualitative analysis identifies two recurring failure modes: imprecise grounding of actual sounds and cross-modal hallucination where models fabricate plausible but nonexistent sound sources influenced by visual context.
Key Points
- The authors build a benchmark from 300 egocentric Ego4D clips and 1,000 manually reviewed sound-focused Q/A pairs, using a two-stage pipeline (sound-source matching followed by Q/A generation) to evaluate audio hallucination systematically.
- They propose a grounded taxonomy that separates foreground user-generated action sounds from background ambient sounds, enabling finer analysis of model errors and revealing that background sounds consistently yield higher accuracy than foreground sounds across all models.
- Experiments on four AV-LLMs show that even the best-performing model (Qwen2.5 Omni) exhibits high hallucination rates, with all models performing substantially worse on hallucination detection than on factual Q/A, indicating strong reliance on visual priors rather than the actual audio track.