FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
Form Text Generation
- URL: http://arxiv.org/abs/2305.14251v2
- Date: Wed, 11 Oct 2023 05:27:50 GMT
- Title: FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
Form Text Generation
- Authors: Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang
Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi
- Abstract summary: evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial.
We introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
- Score: 176.56131810249602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the factuality of long-form text generated by large language
models (LMs) is non-trivial because (1) generations often contain a mixture of
supported and unsupported pieces of information, making binary judgments of
quality inadequate, and (2) human evaluation is time-consuming and costly. In
this paper, we introduce FACTSCORE, a new evaluation that breaks a generation
into a series of atomic facts and computes the percentage of atomic facts
supported by a reliable knowledge source. We conduct an extensive human
evaluation to obtain FACTSCOREs of people biographies generated by several
state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the
retrieval-augmented PerplexityAI -- and report new analysis demonstrating the
need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since
human evaluation is costly, we also introduce an automated model that estimates
FACTSCORE using retrieval and a strong language model, with less than a 2%
error rate. Finally, we use this automated metric to evaluate 6,500 generations
from a new set of 13 recent LMs that would have cost $26K if evaluated by
humans, with various findings: GPT-4 and ChatGPT are more factual than public
models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is
available for public use via `pip install factscore`.
Related papers
- CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large
Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation.
We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics.
We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Cut the CARP: Fishing for zero-shot story evaluation [0.0]
Contrastive Authoring and Reviewing Pairing is a scalable, efficient method for performing superior, zero-shot evaluation of stories.
We show a strong correlation between human evaluation of stories and those of CARP.
We also present and analyze the Story-Critique dataset, a new corpora composed of 1.3 million aligned story-critique pairs derived from over 80,000 stories.
arXiv Detail & Related papers (2021-10-06T23:50:46Z) - The Human Evaluation Datasheet 1.0: A Template for Recording Details of
Human Evaluation Experiments in NLP [1.4467794332678539]
The Human Evaluation is a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP)
The Human Evaluation is intended to facilitate the recording of properties of human evaluations in sufficient detail.
arXiv Detail & Related papers (2021-03-17T15:08:50Z) - What Can We Learn from Collective Human Opinions on Natural Language
Inference Data? [88.90490998032429]
ChaosNLI is a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS.
This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI.
arXiv Detail & Related papers (2020-10-07T17:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.