Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models
- URL: http://arxiv.org/abs/2408.09429v1
- Date: Sun, 18 Aug 2024 10:07:02 GMT
- Title: Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models
- Authors: Kening Zheng, Junkai Chen, Yibo Yan, Xin Zou, Xuming Hu,
- Abstract summary: Hallucination issues persistently plagued current multimodal large language models (MLLMs)
We introduce Reefknot, a benchmark specifically targeting relation hallucinations, consisting of over 20,000 samples derived from real-world scenarios.
Our comparative evaluation across three distinct tasks revealed a substantial shortcoming in the capabilities of current MLLMs to mitigate relation hallucinations.
- Score: 13.48296910438554
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Hallucination issues persistently plagued current multimodal large language models (MLLMs). While existing research primarily focuses on object-level or attribute-level hallucinations, sidelining the more sophisticated relation hallucinations that necessitate advanced reasoning abilities from MLLMs. Besides, recent benchmarks regarding relation hallucinations lack in-depth evaluation and effective mitigation. Moreover, their datasets are typically derived from a systematic annotation process, which could introduce inherent biases due to the predefined process. To handle the aforementioned challenges, we introduce Reefknot, a comprehensive benchmark specifically targeting relation hallucinations, consisting of over 20,000 samples derived from real-world scenarios. Specifically, we first provide a systematic definition of relation hallucinations, integrating perspectives from perceptive and cognitive domains. Furthermore, we construct the relation-based corpus utilizing the representative scene graph dataset Visual Genome (VG), from which semantic triplets follow real-world distributions. Our comparative evaluation across three distinct tasks revealed a substantial shortcoming in the capabilities of current MLLMs to mitigate relation hallucinations. Finally, we advance a novel confidence-based mitigation strategy tailored to tackle the relation hallucinations problem. Across three datasets, including Reefknot, we observed an average reduction of 9.75% in the hallucination rate. We believe our paper sheds valuable insights into achieving trustworthy multimodal intelligence. Our dataset and code will be released upon paper acceptance.
Related papers
- Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models [22.996176483599868]
We design a unified framework to measure object and relation hallucination in Large Vision-Language Models (LVLMs) simultaneously.
Based on our framework, we introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark.
arXiv Detail & Related papers (2024-10-30T15:25:06Z) - From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization [6.37435726278524]
We investigate how hallucinations manifest in large language models (LLMs) when summarizing topic-specific information from multiple documents.
On average, up to 75% of the content in LLM-generated summary is hallucinated, with hallucinations more likely to occur towards the end of the summaries.
To understand the characteristics of these hallucinations, we manually evaluate 700+ insights and find that most errors stem from either failing to follow instructions or producing overly generic insights.
arXiv Detail & Related papers (2024-10-17T18:38:53Z) - ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications.
Current hallucination detection and mitigation datasets are limited in domains and sizes.
This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z) - Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models [69.79709804046325]
We introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination.
R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension.
We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object.
arXiv Detail & Related papers (2024-06-24T08:42:42Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z) - HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data [102.56792377624927]
hallucinations inherent in machine-generated data remain under-explored.
We present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm.
Our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.
arXiv Detail & Related papers (2023-11-22T04:52:58Z) - The Troubling Emergence of Hallucination in Large Language Models -- An
Extensive Definition, Quantification, and Prescriptive Remediations [10.20632187568563]
We offer a discourse on profiling hallucination based on its degree, orientation, and category.
We categorize hallucination into six types: (i) acronym ambiguity, (ii) numeric nuisance, (iii) generated golem, (iv) virtual voice, (v) geographic erratum, and (vi) time wrap.
We propose Hallucination Vulnerability Index (HVI) as a tool for the wider NLP community, with the potential to serve as a rubric in AI-related policy-making.
arXiv Detail & Related papers (2023-10-08T03:31:29Z) - Detecting and Preventing Hallucinations in Large Vision Language Models [4.7264116948935975]
M-HalDetect is the first multi-modal hallucination detection dataset for detailed image descriptions.
We train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling.
We find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively.
arXiv Detail & Related papers (2023-08-11T21:35:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.