Related papers: LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

URL: http://arxiv.org/abs/2410.09962v2
Date: Tue, 15 Oct 2024 16:10:26 GMT
Title: LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models
Authors: Han Qiu, Jiaxing Huang, Peng Gao, Qin Qi, Xiaoqin Zhang, Ling Shao, Shijian Lu,
Abstract summary: LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
Score: 96.64960606650115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hallucination, a phenomenon where multimodal large language models~(MLLMs) tend to generate textual responses that are plausible but unaligned with the image, has become one major hurdle in various MLLM-related applications. Several benchmarks have been created to gauge the hallucination levels of MLLMs, by either raising discriminative questions about the existence of objects or introducing LLM evaluators to score the generated text from MLLMs. However, the discriminative data largely involve simple questions that are not aligned with real-world text, while the generative data involve LLM evaluators that are computationally intensive and unstable due to their inherent randomness. We propose LongHalQA, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios, including object/image descriptions and multi-round conversations with 14/130 words and 189 words, respectively, on average. It introduces two new tasks, hallucination discrimination and hallucination completion, unifying both discriminative and generative evaluations in a single multiple-choice-question form and leading to more reliable and efficient evaluations without the need for LLM evaluators. Further, we propose an advanced pipeline that greatly facilitates the construction of future hallucination benchmarks with long and complex questions and descriptions. Extensive experiments over multiple recent MLLMs reveal various new challenges when they are handling hallucinations with long and complex textual data. Dataset and evaluation code are available at https://github.com/hanqiu-hq/LongHalQA.

Related papers

Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts [21.081815261690444]
Large language models (LLMs) often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge.<n>This poses serious risks in domains like healthcare, finance, and customer support.<n>We introduce CONFACTCHECK, an efficient detection approach that does not leverage any external knowledge base.
arXiv Detail & Related papers (2025-11-15T14:33:02Z)
Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization [55.543583937522804]
Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks.<n>Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate.<n>In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations.
arXiv Detail & Related papers (2025-08-27T18:02:04Z)
UCSC at SemEval-2025 Task 3: Context, Models and Prompt Optimization for Automated Hallucination Detection in LLM Output [7.121378498209948]
SemEval 2025 Task 3, Mu-SHROOM: Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, is a recent effort in this direction.<n>This paper describes the UCSC system submission to the shared Mu-SHROOM task.<n>We introduce a framework that first retrieves relevant context, next identifies false content from the answer, and finally maps them back to spans in the LLM output.
arXiv Detail & Related papers (2025-05-05T21:15:40Z)
HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations [2.3732122943029164]
We introduce HalluVerse25, a multilingual dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality.
arXiv Detail & Related papers (2025-03-10T20:24:07Z)
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild [11.82100047858478]
hallucination is the tendency of Large Language Models to generate non-factual or unfaithful responses. We train a multilingual hallucination detection model and conduct a large-scale study across 30 languages. We find that while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation.
arXiv Detail & Related papers (2025-02-18T11:32:43Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks. This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization [6.37435726278524]
We investigate how hallucinations manifest in large language models (LLMs) when summarizing topic-specific information from multiple documents. On average, up to 75% of the content in LLM-generated summary is hallucinated, with hallucinations more likely to occur towards the end of the summaries. To understand the characteristics of these hallucinations, we manually evaluate 700+ insights and find that most errors stem from either failing to follow instructions or producing overly generic insights.
arXiv Detail & Related papers (2024-10-17T18:38:53Z)
Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks. They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences. We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z)
DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models [26.289847386286446]
We propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. We integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. We manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios.
arXiv Detail & Related papers (2024-03-01T15:38:55Z)
Context-aware Decoding Reduces Hallucination in Query-focused Summarization [2.8554857235549753]
We conduct a large-scale study on one recently proposed decoding method -- Context-aware Decoding (CAD) Experiments with eight different language models show that performance-wise, CAD improves QFS quality by reducing factuality errors/hallucinations. The code implementation based on Huggingface Library is made available.
arXiv Detail & Related papers (2023-12-21T23:42:13Z)
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation [58.19101663976327]
Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations. evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. We propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task.
arXiv Detail & Related papers (2023-11-13T15:25:42Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.