Generating Benchmarks for Factuality Evaluation of Language Models
- URL: http://arxiv.org/abs/2307.06908v2
- Date: Sun, 4 Feb 2024 09:07:54 GMT
- Title: Generating Benchmarks for Factuality Evaluation of Language Models
- Authors: Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan
Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham
- Abstract summary: We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
- Score: 61.69950787311278
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Before deploying a language model (LM) within a given domain, it is important
to measure its tendency to generate factually incorrect information in that
domain. Existing methods for factuality evaluation of LLM generation focus on
facts sampled from the LM itself, and thus do not control the set of evaluated
facts and might under-represent domain specific or rare facts. We propose
FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for
evaluating LM factuality. FACTOR automatically transforms a factual corpus of
interest into a benchmark evaluating an LM's propensity to generate true facts
from the corpus vs. similar but incorrect statements. We use our framework to
create three benchmarks: Wiki-FACTOR, News-FACTOR and Expert-FACTOR. We show
that: (i) our benchmark scores increase with model size and improve when the LM
is augmented with retrieval; (ii) benchmark score and perplexity do not always
agree on model ranking; (iii) when perplexity and benchmark score disagree, the
latter better reflects factuality in open-ended generation, as measured by
human annotators. We make our data and code publicly available in
https://github.com/AI21Labs/factor.
Related papers
- Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.