ARGUS: Hallucination and Omission Evaluation in Video-LLMs
- URL: http://arxiv.org/abs/2506.07371v2
- Date: Tue, 10 Jun 2025 13:33:53 GMT
- Title: ARGUS: Hallucination and Omission Evaluation in Video-LLMs
- Authors: Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, Tom Goldstein,
- Abstract summary: ARGUS is a VideoLLM benchmark that measures freeform video captioning performance.<n>By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics.
- Score: 86.73977434293973
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video large language models have not yet been widely deployed, largely due to their tendency to hallucinate. Typical benchmarks for Video-LLMs rely simply on multiple-choice questions. Unfortunately, VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics. First, we measure the rate of hallucinations in the form of incorrect statements about video content or temporal relationships. Second, we measure the rate at which the model omits important descriptive details. Together, these dual metrics form a comprehensive view of video captioning performance.
Related papers
- OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models [65.8015696586307]
We introduce OV-Fact, a novel method for measuring caption factuality of long captions.<n>Our method improves agreement with human judgments and captures both captionness (recall) and factual precision in the same metric.<n>Unlike previous metrics, our reference-free method design enables new applications towards factuality-based data filtering.
arXiv Detail & Related papers (2025-07-25T13:38:06Z) - FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation [30.111545374280194]
VideoMLLMs have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks.<n>They often suffer hallucinations, generating content that contradicts the visual input.<n>Existing evaluation methods are limited to one task and also fail to assess hallucinations in open-ended, free-form responses.<n>We propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts.<n>We also introduce Post-Correction, a tool-based correction framework that revises hallucinated content.
arXiv Detail & Related papers (2025-07-09T03:51:27Z) - SiLVR: A Simple Language-based Video Reasoning Framework [71.77141065418238]
We present SiLVR, a Simple Language-based Video Reasoning framework.<n>In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs.<n>In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks.
arXiv Detail & Related papers (2025-05-30T17:59:19Z) - VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding [54.16233954353802]
We introduce VideoHallu, a benchmark of over 3,000 video QA pairs built from synthetic videos generated by models like Veo2, Sora, and Kling.<n>We evaluate the critical thinking abilities of Multi-modal Large Language Models (MLLMs) on abnormalities that are perceptually obvious to humans but often hallucinated due to language priors.<n>We observe that these models perform well on many real-world benchmarks like MVBench and MovieChat, but still struggle with basic physics-based and commonsense reasoning in synthetic videos.
arXiv Detail & Related papers (2025-05-02T15:58:38Z) - All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark [74.4821011648997]
MAIA is a benchmark for fine-grained investigation of the reasoning abilities of visual language models on videos.<n>It considers twelve categories that aim to disentangle language and vision relations by highlighting the role of the visual input.<n>MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos.
arXiv Detail & Related papers (2025-02-24T09:25:51Z) - CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval [24.203328970223527]
We present CaReBench, a testing benchmark for fine-grained video captioning and retrieval.<n>Uniquely, it provides manually separated spatial annotations and temporal annotations for each video.<n>Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z) - ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models [13.04745908368858]
We introduce ViBe: a large-scale dataset of hallucinated videos from open-source T2V models.<n>Using ten T2V models, we generated and manually annotated 3,782 videos from 837 MS captions.<n>Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings.
arXiv Detail & Related papers (2024-11-16T19:23:12Z) - EventHallusion: Diagnosing Event Hallucinations in Video LLMs [42.66453293963568]
Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field.<n>We propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event.<n>We also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs.
arXiv Detail & Related papers (2024-09-25T03:49:46Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z) - Fine-grained Audible Video Description [61.81122862375985]
We construct the first fine-grained audible video description benchmark (FAVDBench)
For each video clip, we first provide a one-sentence summary of the video, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end.
We demonstrate that employing fine-grained video descriptions can create more intricate videos than using captions.
arXiv Detail & Related papers (2023-03-27T22:03:48Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.