EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
- URL: http://arxiv.org/abs/2508.12687v2
- Date: Sat, 23 Aug 2025 08:06:02 GMT
- Title: EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
- Authors: Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvakumar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, Dinesh Manocha,
- Abstract summary: EgoIllusion is a first benchmark to evaluate hallucinations in MLLMs in egocentric videos.<n>EgoIllusion comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions.<n> Evaluations across ten MLLMs reveal significant challenges, including powerful models like GPT-4o and Gemini, achieving only 59% accuracy.
- Score: 80.64794443484698
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in complex multimodal tasks. While MLLMs excel at visual perception and reasoning in third-person and egocentric videos, they are prone to hallucinations, generating coherent yet inaccurate responses. We present EgoIllusion, a first benchmark to evaluate MLLM hallucinations in egocentric videos. EgoIllusion comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions designed to trigger hallucinations in both visual and auditory cues in egocentric videos. Evaluations across ten MLLMs reveal significant challenges, including powerful models like GPT-4o and Gemini, achieving only 59% accuracy. EgoIllusion lays the foundation in developing robust benchmarks to evaluate the effectiveness of MLLMs and spurs the development of better egocentric MLLMs with reduced hallucination rates. Our benchmark will be open-sourced for reproducibility.
Related papers
- EgoSound: Benchmarking Sound Understanding in Egocentric Videos [68.1897133235638]
We introduce EgoSound, the first benchmark designed to evaluate egocentric sound understanding in MLLMs.<n>EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences.<n>It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning.
arXiv Detail & Related papers (2026-02-15T12:46:35Z) - Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation [49.885797244626694]
hallucination of large multimodal models (LMMs) provides responses that appear correct but are actually incorrect.<n>This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text.
arXiv Detail & Related papers (2025-03-25T13:12:17Z) - Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding [69.96199605596138]
Current MLLMs primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos.<n>We propose learning the mapping between exocentric and egocentric domains to enhance egocentric video understanding.<n>We introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs.
arXiv Detail & Related papers (2025-03-12T08:10:33Z) - EgoBlind: Towards Egocentric Visual Assistance for the Blind [69.6161191190939]
EgoBlind is the first egocentric VideoQA dataset collected from blind individuals.<n>It comprises 1,392 videos that record the daily lives of real blind users from a first-person perspective.<n>It also features 5,311 questions directly posed or generated by blind individuals to reflect their in-situation needs for visual assistance.
arXiv Detail & Related papers (2025-03-11T09:40:31Z) - VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding [1.1834200163382398]
We introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding.<n> VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition.<n>We propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference.
arXiv Detail & Related papers (2024-12-04T22:03:19Z) - Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks.
We present QL-Bench, a benchmark settings to simulate human responses to low-level vision.
We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z) - Look Within, Why LLMs Hallucinate: A Causal Perspective [16.874588396996764]
Large language models (LLMs) are a milestone in generative artificial intelligence, achieving significant success in text comprehension and generation tasks.
LLMs suffer from severe hallucination problems, posing significant challenges to the practical applications of LLMs.
We propose a method to intervene in LLMs' self-attention layers and maintain their structures and sizes intact.
arXiv Detail & Related papers (2024-07-14T10:47:44Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.