Language Models Hallucinate, but May Excel at Fact Verification
- URL: http://arxiv.org/abs/2310.14564v2
- Date: Thu, 21 Mar 2024 02:56:22 GMT
- Title: Language Models Hallucinate, but May Excel at Fact Verification
- Authors: Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, Hao Peng,
- Abstract summary: Large language models (LLMs) frequently "hallucinate," resulting in non-factual outputs.
Even GPT-3.5 produces factual outputs less than 25% of the time.
This underscores the importance of fact verifiers in order to measure and incentivize progress.
- Score: 89.0833981569957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently "hallucinate," resulting in non-factual outputs. Our carefully-designed human evaluation substantiates the serious hallucination issue, revealing that even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress. Our systematic investigation affirms that LLMs can be repurposed as effective fact verifiers with strong correlations with human judgments. Surprisingly, FLAN-T5-11B, the least factual generator in our study, performs the best as a fact verifier, even outperforming more capable LLMs like GPT3.5 and ChatGPT. Delving deeper, we analyze the reliance of these LLMs on high-quality evidence, as well as their deficiencies in robustness and generalization ability. Our study presents insights for developing trustworthy generation models.
Related papers
- Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown [55.91887554462312]
We investigate the factuality of long-form text generation across various large language models (LLMs)
Our analysis reveals that factuality scores tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims.
We find a correlation between higher Self-Known scores and improved factuality, while higher Self-Unknown scores are associated with lower factuality.
arXiv Detail & Related papers (2024-11-24T22:06:26Z) - Are Large Language Models Good Fact Checkers: A Preliminary Study [26.023148371263012]
Large Language Models (LLMs) have drawn significant attention due to their outstanding reasoning capabilities and extensive knowledge repository.
This study aims to comprehensively evaluate various LLMs in tackling specific fact-checking subtasks.
arXiv Detail & Related papers (2023-11-29T05:04:52Z) - Are Large Language Models Reliable Judges? A Study on the Factuality
Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities.
This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z) - Improving Factual Consistency of Text Summarization by Adversarially
Decoupling Comprehension and Embellishment Abilities of LLMs [67.56087611675606]
Large language models (LLMs) generate summaries that are factually inconsistent with original articles.
These hallucinations are challenging to detect through traditional methods.
We propose an adversarially DEcoupling method to disentangle the abilities of LLMs (DECENT)
arXiv Detail & Related papers (2023-10-30T08:40:16Z) - Large Language Models are biased to overestimate profoundness [0.0]
This study evaluates GPT-4 and various other large language models (LLMs) in judging the profoundness of mundane, motivational, and pseudo-profound statements.
We found a significant statement-to-statement correlation between the LLMs and humans, irrespective of the type of statements and the prompting technique used.
arXiv Detail & Related papers (2023-10-22T21:33:50Z) - The Perils & Promises of Fact-checking with Large Language Models [55.869584426820715]
Large Language Models (LLMs) are increasingly trusted to write academic papers, lawsuits, and news articles.
We evaluate the use of LLM agents in fact-checking by having them phrase queries, retrieve contextual data, and make decisions.
Our results show the enhanced prowess of LLMs when equipped with contextual information.
While LLMs show promise in fact-checking, caution is essential due to inconsistent accuracy.
arXiv Detail & Related papers (2023-10-20T14:49:47Z) - FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of
LLMs by Validating Low-Confidence Generation [76.34411067299331]
Large language models often tend to 'hallucinate' which critically hampers their reliability.
We propose an approach that actively detects and mitigates hallucinations during the generation process.
We show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average.
arXiv Detail & Related papers (2023-07-08T14:25:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.