FELM: Benchmarking Factuality Evaluation of Large Language Models
- URL: http://arxiv.org/abs/2310.00741v2
- Date: Tue, 28 Nov 2023 08:06:53 GMT
- Title: FELM: Benchmarking Factuality Evaluation of Large Language Models
- Authors: Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao,
Pengfei Liu and Junxian He
- Abstract summary: We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
- Score: 40.78878196872095
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Assessing factuality of text generated by large language models (LLMs) is an
emerging yet crucial research area, aimed at alerting users to potential errors
and guiding the development of more reliable LLMs. Nonetheless, the evaluators
assessing factuality necessitate suitable evaluation themselves to gauge
progress and foster advancements. This direction remains under-explored,
resulting in substantial impediments to the progress of factuality evaluators.
To mitigate this issue, we introduce a benchmark for Factuality Evaluation of
large Language Models, referred to as felm. In this benchmark, we collect
responses generated from LLMs and annotate factuality labels in a fine-grained
manner. Contrary to previous studies that primarily concentrate on the
factuality of world knowledge (e.g.~information from Wikipedia), felm focuses
on factuality across diverse domains, spanning from world knowledge to math and
reasoning. Our annotation is based on text segments, which can help pinpoint
specific factual errors. The factuality annotations are further supplemented by
predefined error types and reference links that either support or contradict
the statement. In our experiments, we investigate the performance of several
LLM-based factuality evaluators on felm, including both vanilla LLMs and those
augmented with retrieval mechanisms and chain-of-thought processes. Our
findings reveal that while retrieval aids factuality evaluation, current LLMs
are far from satisfactory to faithfully detect factual errors.
Related papers
- Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.
It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.
We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models [12.112914393948415]
We present RUPBench, a benchmark designed to evaluate large language models (LLMs) across diverse reasoning tasks.
Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning.
By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.
arXiv Detail & Related papers (2024-06-16T17:26:44Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Are Large Language Models Reliable Judges? A Study on the Factuality
Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities.
This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.