Related papers: EventHallusion: Diagnosing Event Hallucinations in Video LLMs

EventHallusion: Diagnosing Event Hallucinations in Video LLMs

URL: http://arxiv.org/abs/2409.16597v2
Date: Fri, 03 Jan 2025 10:57:17 GMT
Title: EventHallusion: Diagnosing Event Hallucinations in Video LLMs
Authors: Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Na Zhao, Jingjing Chen,
Abstract summary: Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field.<n>We propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event.<n>We also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs.
Score: 39.65906480963502
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field. Despite remarkable content reasoning and instruction following capabilities they demonstrated, the hallucination problem of these VideoLLMs is less explored compared with its counterpart in the image domain. To mitigate this gap, we propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event, the crux of video analysis. From a hallucination attribution perspective, our EventHallusion benchmark is curated to assess a VideoLLM's susceptibility toward language priors and vision-language biases. On the other hand, we also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs. The proposed TCD method rectifies the model's bias toward its priors during the decoding stage by comparing the original video with a modified version, in which temporal cues are disrupted. Through comprehensive evaluation of eight open-source and two closed-source VideoLLMs on the proposed EventHallusion benchmark, we observe that the open-source models suffer significantly from hallucination problems, whereas the closed-source ones perform markedly better. By further equipping open-source VideoLLMs with the proposed TCD approach, evident performance improvements are achieved across most metrics in the EventHallusion benchmark. Our codes and benchmark data are available at https://github.com/Stevetich/EventHallusion.

Related papers

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding [30.820850789099932]
We propose a training-free method that adaptively enhances temporal and spatial faithfulness for each output token.<n>SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks.
arXiv Detail & Related papers (2025-12-04T10:17:20Z)
Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding [35.20942192333083]
We introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward.<n>Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps.<n>For the risk of hallucinations, the Factual-Aware Evaluator evaluates each perception result as a reliable anti-hallucination reward.
arXiv Detail & Related papers (2025-11-23T14:14:14Z)
NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models [8.6767620170781]
Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering.<n>Many models and training methods explicitly encourage continuity across events to enhance narrative coherence.<n>We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context.
arXiv Detail & Related papers (2025-11-09T17:41:11Z)
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding [61.526407756322264]
We introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination.<n>We find that models are more prone to SAH on rapidly changing semantics.<n>We also achieve improvements on both ELV-Halluc and Video-MME.
arXiv Detail & Related papers (2025-08-29T10:25:03Z)
ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM [12.091189146069198]
Multimodal Large Language Model (MLLM) often suffer from hallucinations.<n>They over-rely on partial cues and generate incorrect responses.<n>Recent methods like Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding (ICD) have been proposed to mitigate hallucinations.
arXiv Detail & Related papers (2025-06-17T17:58:11Z)
ARGUS: Hallucination and Omission Evaluation in Video-LLMs [86.73977434293973]
ARGUS is a VideoLLM benchmark that measures freeform video captioning performance.<n>By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics.
arXiv Detail & Related papers (2025-06-09T02:42:13Z)
Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering [83.63437999696954]
hallucination in large language models (MLLMs) persists as a significant and under-addressed challenge in the video domain.<n>We propose a temporal-aware activation engineering framework for VideoLLMs, which adaptively identifies and manipulates hallucination-sensitive modules.
arXiv Detail & Related papers (2025-05-19T08:12:06Z)
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z)
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation [49.885797244626694]
hallucination of large multimodal models (LMMs) provides responses that appear correct but are actually incorrect. This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text.
arXiv Detail & Related papers (2025-03-25T13:12:17Z)
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks. LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z)
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs [9.392258475822915]
We introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations. A defining feature of VidHal is the careful creation of captions which represent varying levels of captions associated with each video. We propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent.
arXiv Detail & Related papers (2024-11-25T06:17:23Z)
VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding [38.23310445372371]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning. They often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage. We propose a novel hallucination-mitigation method from the visual encoding perspective: textbfVisutextbfal textbfLayer Fustextbfion Contrastive textbfD
arXiv Detail & Related papers (2024-11-24T13:42:02Z)
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning [23.928977574352796]
We introduce a new task and dataset, Multi-Event Causal Discovery (MECD) It aims to uncover the causal relationships between events distributed chronologically across long videos. We devise a novel framework inspired by the Granger Causality method, using an efficient mask-based event prediction model.
arXiv Detail & Related papers (2024-09-26T08:51:29Z)
Lower Layer Matters: Alleviating Hallucination via Multi-Layer Fusion Contrastive Decoding with Truthfulness Refocused [44.37155553647802]
Large Language Models (LLMs) have demonstrated exceptional performance across various natural language processing tasks. They occasionally yield content that factually inaccurate or discordant with the expected output. Recent works have investigated contrastive decoding between the original model and an amateur model with induced hallucination. We introduce a novel contrastive decoding framework termed LOL (LOwer Layer Matters)
arXiv Detail & Related papers (2024-08-16T14:23:59Z)
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs) VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z)
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z)
FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs [57.59518049930211]
We propose the first adversarial attack tailored for video-based large language models (LLMs) Our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations. Our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate.
arXiv Detail & Related papers (2024-03-20T11:05:07Z)
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models [20.33971942003996]
This study introduces an innovative method to address event-level hallucinations in MLLMs. We propose a unique mechanism that decomposes on-demand event queries into iconic actions. We employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences.
arXiv Detail & Related papers (2024-01-18T10:18:48Z)
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding [125.05295513481035]
We introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families.
arXiv Detail & Related papers (2023-11-28T16:26:35Z)
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications. Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities. We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z)
Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.