HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination
Tendency of LLMs
- URL: http://arxiv.org/abs/2402.16211v1
- Date: Sun, 25 Feb 2024 22:23:37 GMT
- Title: HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination
Tendency of LLMs
- Authors: Cem Uluoglakci, Tugba Taskaya Temizel (Middle East Technical
University)
- Abstract summary: Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs)
This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection.
The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hallucinations pose a significant challenge to the reliability and alignment
of Large Language Models (LLMs), limiting their widespread acceptance beyond
chatbot applications. Despite ongoing efforts, hallucinations remain a
prevalent challenge in LLMs. The detection of hallucinations itself is also a
formidable task, frequently requiring manual labeling or constrained
evaluations. This paper introduces an automated scalable framework that
combines benchmarking LLMs' hallucination tendencies with efficient
hallucination detection. We leverage LLMs to generate challenging tasks related
to hypothetical phenomena, subsequently employing them as agents for efficient
hallucination detection. The framework is domain-agnostic, allowing the use of
any language model for benchmark creation or evaluation in any domain. We
introduce the publicly available HypoTermQA Benchmarking Dataset, on which
state-of-the-art models' performance ranged between 3% and 11%, and evaluator
agents demonstrated a 6% error rate in hallucination prediction. The proposed
framework provides opportunities to test and improve LLMs. Additionally, it has
the potential to generate benchmarking datasets tailored to specific domains,
such as law, health, and finance.
Related papers
- Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models [67.89204055004028]
Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination.
Previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics.
We propose a Hallucination benchmark Quality Measurement framework (HQM) to assess the reliability and validity of existing hallucination benchmarks.
arXiv Detail & Related papers (2024-06-24T20:08:07Z) - Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models [11.138489774712163]
We propose an innovative approach leveraging logic programming to enhance metamorphic testing for detecting Fact-Conflicting Hallucinations (FCH)
Our method generates test cases and detects hallucinations across six different large language models spanning nine domains, revealing rates ranging from 24.7% to 59.8%.
arXiv Detail & Related papers (2024-05-01T17:24:42Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild [41.86776426516293]
Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains.
We introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild.
arXiv Detail & Related papers (2024-03-07T08:25:46Z) - DelucionQA: Detecting Hallucinations in Domain-specific Question
Answering [22.23664008053246]
Hallucination is a well-known phenomenon in text generated by large language models (LLMs)
We introduce a dataset, DelucionQA, that captures hallucinations made by retrieval-augmented LLMs for a domain-specific QA task.
We propose a set of hallucination detection methods to serve as baselines for future works from the research community.
arXiv Detail & Related papers (2023-12-08T17:41:06Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z) - AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination
Evaluation [58.19101663976327]
Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations.
evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment.
We propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task.
arXiv Detail & Related papers (2023-11-13T15:25:42Z) - Chainpoll: A high efficacy method for LLM hallucination detection [0.0]
We introduce ChainPoll, an innovative hallucination detection method that excels compared to its counterparts.
We also unveil RealHall, a refined collection of benchmark datasets to assess hallucination detection metrics from recent studies.
arXiv Detail & Related papers (2023-10-22T14:45:14Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall.
We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.