Related papers: Chainpoll: A high efficacy method for LLM hallucination detection

Chainpoll: A high efficacy method for LLM hallucination detection

URL: http://arxiv.org/abs/2310.18344v1
Date: Sun, 22 Oct 2023 14:45:14 GMT
Title: Chainpoll: A high efficacy method for LLM hallucination detection
Authors: Robert Friel, Atindriyo Sanyal
Abstract summary: We introduce ChainPoll, an innovative hallucination detection method that excels compared to its counterparts. We also unveil RealHall, a refined collection of benchmark datasets to assess hallucination detection metrics from recent studies.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have experienced notable advancements in generating coherent and contextually relevant responses. However, hallucinations - incorrect or unfounded claims - are still prevalent, prompting the creation of automated metrics to detect these in LLM outputs. Our contributions include: introducing ChainPoll, an innovative hallucination detection method that excels compared to its counterparts, and unveiling RealHall, a refined collection of benchmark datasets to assess hallucination detection metrics from recent studies. While creating RealHall, we assessed tasks and datasets from previous hallucination detection studies and observed that many are not suitable for the potent LLMs currently in use. Overcoming this, we opted for four datasets challenging for modern LLMs and pertinent to real-world scenarios. Using RealHall, we conducted a comprehensive comparison of ChainPoll with numerous hallucination metrics from recent studies. Our findings indicate that ChainPoll outperforms in all RealHall benchmarks, achieving an overall AUROC of 0.781. This surpasses the next best theoretical method by 11% and exceeds industry standards by over 23%. Additionally, ChainPoll is cost-effective and offers greater transparency than other metrics. We introduce two novel metrics to assess LLM hallucinations: Adherence and Correctness. Adherence is relevant to Retrieval Augmented Generation workflows, evaluating an LLM's analytical capabilities within given documents and contexts. In contrast, Correctness identifies logical and reasoning errors.

Related papers

HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination" This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z)
REFIND at SemEval-2025 Task 3: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models [15.380441563675243]
REFIND (Retrieval-augmented Factuality hallucINation Detection) is a novel framework that detects hallucinated spans within large language model (LLM) outputs. We propose the Context Sensitivity Ratio (CSR), a novel metric that quantifies the sensitivity of LLM outputs to retrieved evidence. REFIND demonstrated robustness across nine languages, including low-resource settings, and significantly outperformed baseline models.
arXiv Detail & Related papers (2025-02-19T10:59:05Z)
LLM Hallucination Reasoning with Zero-shot Knowledge Test [10.306443936136425]
We introduce a new task, Hallucination Reasoning, which classifies LLM-generated text into one of three categories: aligned, misaligned, and fabricated. Our experiments conducted on new datasets demonstrate the effectiveness of our method in hallucination reasoning.
arXiv Detail & Related papers (2024-11-14T18:55:26Z)
LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z)
Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks. They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences. We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z)
Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends [38.86240794422485]
We evaluate the faithfulness of large language models for dialogue summarization. Our evaluation reveals subtleties as to what constitutes a hallucination. We introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics.
arXiv Detail & Related papers (2024-06-05T17:49:47Z)
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs [0.0]
Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs) This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain.
arXiv Detail & Related papers (2024-02-25T22:23:37Z)
Fine-Grained Self-Endorsement Improves Factuality and Reasoning [72.83651220132495]
This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations. We propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses.
arXiv Detail & Related papers (2024-02-23T22:24:40Z)
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations. We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks. We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.