Chainpoll: A high efficacy method for LLM hallucination detection
- URL: http://arxiv.org/abs/2310.18344v1
- Date: Sun, 22 Oct 2023 14:45:14 GMT
- Title: Chainpoll: A high efficacy method for LLM hallucination detection
- Authors: Robert Friel, Atindriyo Sanyal
- Abstract summary: We introduce ChainPoll, an innovative hallucination detection method that excels compared to its counterparts.
We also unveil RealHall, a refined collection of benchmark datasets to assess hallucination detection metrics from recent studies.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have experienced notable advancements in
generating coherent and contextually relevant responses. However,
hallucinations - incorrect or unfounded claims - are still prevalent, prompting
the creation of automated metrics to detect these in LLM outputs. Our
contributions include: introducing ChainPoll, an innovative hallucination
detection method that excels compared to its counterparts, and unveiling
RealHall, a refined collection of benchmark datasets to assess hallucination
detection metrics from recent studies. While creating RealHall, we assessed
tasks and datasets from previous hallucination detection studies and observed
that many are not suitable for the potent LLMs currently in use. Overcoming
this, we opted for four datasets challenging for modern LLMs and pertinent to
real-world scenarios. Using RealHall, we conducted a comprehensive comparison
of ChainPoll with numerous hallucination metrics from recent studies. Our
findings indicate that ChainPoll outperforms in all RealHall benchmarks,
achieving an overall AUROC of 0.781. This surpasses the next best theoretical
method by 11% and exceeds industry standards by over 23%. Additionally,
ChainPoll is cost-effective and offers greater transparency than other metrics.
We introduce two novel metrics to assess LLM hallucinations: Adherence and
Correctness. Adherence is relevant to Retrieval Augmented Generation workflows,
evaluating an LLM's analytical capabilities within given documents and
contexts. In contrast, Correctness identifies logical and reasoning errors.
Related papers
- REFIND: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models [15.380441563675243]
Hallucinations in large language model (LLM) outputs severely limit their reliability in knowledge-intensive tasks such as question answering.
We introduce REFIND, a novel framework that detects hallucinated spans within LLM outputs by directly leveraging retrieved documents.
REFIND demonstrated robustness across nine languages, including low-resource settings, and significantly outperformed baseline models.
arXiv Detail & Related papers (2025-02-19T10:59:05Z) - LLM Hallucination Reasoning with Zero-shot Knowledge Test [10.306443936136425]
We introduce a new task, Hallucination Reasoning, which classifies LLM-generated text into one of three categories: aligned, misaligned, and fabricated.
Our experiments conducted on new datasets demonstrate the effectiveness of our method in hallucination reasoning.
arXiv Detail & Related papers (2024-11-14T18:55:26Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks.
They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences.
We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z) - Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends [38.86240794422485]
We evaluate the faithfulness of large language models for dialogue summarization.
Our evaluation reveals subtleties as to what constitutes a hallucination.
We introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics.
arXiv Detail & Related papers (2024-06-05T17:49:47Z) - Fine-Grained Self-Endorsement Improves Factuality and Reasoning [72.83651220132495]
This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations.
We propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses.
arXiv Detail & Related papers (2024-02-23T22:24:40Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.