Related papers: MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models

MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models

URL: http://arxiv.org/abs/2509.08538v2
Date: Thu, 11 Sep 2025 11:14:00 GMT
Title: MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
Authors: Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen, Zhenguo Li, Kaiwen Zhou, James Cheng,
Abstract summary: We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically.<n>MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances.<n>We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos.
Score: 56.49314029765706
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos.

Related papers

V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs [72.59885036868499]
v-HUB is a visual-centric video humor understanding benchmark.<n>Each video clip is paired with rich annotations, including captions, descriptions, and explanations.<n>We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio.
arXiv Detail & Related papers (2025-09-30T04:33:52Z)
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding [61.526407756322264]
We introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination.<n>We find that models are more prone to SAH on rapidly changing semantics.<n>We also achieve improvements on both ELV-Halluc and Video-MME.
arXiv Detail & Related papers (2025-08-29T10:25:03Z)
ARGUS: Hallucination and Omission Evaluation in Video-LLMs [86.73977434293973]
ARGUS is a VideoLLM benchmark that measures freeform video captioning performance.<n>By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics.
arXiv Detail & Related papers (2025-06-09T02:42:13Z)
Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs)<n>Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM)<n>Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z)
Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z)
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding [1.1834200163382398]
We introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding.<n> VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition.<n>We propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference.
arXiv Detail & Related papers (2024-12-04T22:03:19Z)
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs [9.392258475822915]
Large hallucination Language Models (VLLMs) are widely acknowledged to be prone to hallucinations.<n>We introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in temporal dynamics.<n>A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video.
arXiv Detail & Related papers (2024-11-25T06:17:23Z)
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs) VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z)
Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs) We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions. We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.