FuguReport

Exploring Audio Hallucination in Egocentric Video Understanding

Authors Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai
Affiliations Meta / University of Maryland
Categories Task / Audio-Visual Understanding / Audio inference from video, Method / Multimodal Language Modeling / Audio-visual language model techniques, Evaluation / Model Behavior Analysis / Hallucination in audio generation
License CC BY 4.0

Abstract Overview

This paper studies audio hallucination in egocentric video understanding, where audio-visual language models (AV-LLMs) describe sounds that are visually implied but not actually present in the audio. The authors introduce a systematic evaluation framework based on targeted question answering, using a curated benchmark of 300 Ego4D video clips and 1,000 manually reviewed sound-focused Q/A pairs. Their framework distinguishes foreground action sounds produced by the camera wearer from background ambient sounds, enabling fine-grained analysis of hallucination behavior. Experiments on four AV-LLMs show that these models often rely on visual context rather than grounding their responses in the audio signal, with hallucination detection accuracy substantially lower than factual Q/A accuracy across all models tested.

Novelty

The paper presents the first taxonomy-driven evaluation specifically focused on audio hallucinations in egocentric videos. Its main distinctive contribution is a source-grounded benchmark and Q/A protocol that separates foreground action sounds from background ambient sounds while explicitly testing for plausible but absent audio events through dedicated hallucination Q/A pairs.

Results

Across four AV-LLMs, hallucination detection accuracy is substantially lower than factual Q/A accuracy. Qwen2.5 Omni, the strongest model tested, achieves only 27.3% and 39.5% accuracy on foreground and background hallucination Q/A, respectively, despite higher factual accuracies of 56.2% and 63.4%. Qualitative analysis identifies two recurring failure modes: imprecise grounding of actual sounds and cross-modal hallucination where models fabricate plausible but nonexistent sound sources influenced by visual context.

Key Points

  1. The authors build a benchmark from 300 egocentric Ego4D clips and 1,000 manually reviewed sound-focused Q/A pairs, using a two-stage pipeline (sound-source matching followed by Q/A generation) to evaluate audio hallucination systematically.
  2. They propose a grounded taxonomy that separates foreground user-generated action sounds from background ambient sounds, enabling finer analysis of model errors and revealing that background sounds consistently yield higher accuracy than foreground sounds across all models.
  3. Experiments on four AV-LLMs show that even the best-performing model (Qwen2.5 Omni) exhibits high hallucination rates, with all models performing substantially worse on hallucination detection than on factual Q/A, indicating strong reliance on visual priors rather than the actual audio track.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.