Related papers: FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

URL: http://arxiv.org/abs/2305.14251v2
Date: Wed, 11 Oct 2023 05:27:50 GMT
Title: FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Authors: Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi
Abstract summary: evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial. We introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
Score: 176.56131810249602
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via `pip install factscore`.

Related papers

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation [25.193026443079987]
HypoEval is a Hypothesis-guided Evaluation framework for large language models (LLMs) With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation) We conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.
arXiv Detail & Related papers (2025-04-09T18:00:01Z)
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks [47.40240774236047]
We compare four Chat Llama 2 models against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences.
arXiv Detail & Related papers (2025-02-24T01:01:02Z)
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data. We evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries. These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z)
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
Cut the CARP: Fishing for zero-shot story evaluation [0.0]
Contrastive Authoring and Reviewing Pairing is a scalable, efficient method for performing superior, zero-shot evaluation of stories. We show a strong correlation between human evaluation of stories and those of CARP. We also present and analyze the Story-Critique dataset, a new corpora composed of 1.3 million aligned story-critique pairs derived from over 80,000 stories.
arXiv Detail & Related papers (2021-10-06T23:50:46Z)
The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP [1.4467794332678539]
The Human Evaluation is a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP) The Human Evaluation is intended to facilitate the recording of properties of human evaluations in sufficient detail.
arXiv Detail & Related papers (2021-03-17T15:08:50Z)
What Can We Learn from Collective Human Opinions on Natural Language Inference Data? [88.90490998032429]
ChaosNLI is a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS. This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI.
arXiv Detail & Related papers (2020-10-07T17:26:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.