FELM: Benchmarking Factuality Evaluation of Large Language Models
- URL: http://arxiv.org/abs/2310.00741v2
- Date: Tue, 28 Nov 2023 08:06:53 GMT
- Title: FELM: Benchmarking Factuality Evaluation of Large Language Models
- Authors: Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao,
Pengfei Liu and Junxian He
- Abstract summary: We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
- Score: 40.78878196872095
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Assessing factuality of text generated by large language models (LLMs) is an
emerging yet crucial research area, aimed at alerting users to potential errors
and guiding the development of more reliable LLMs. Nonetheless, the evaluators
assessing factuality necessitate suitable evaluation themselves to gauge
progress and foster advancements. This direction remains under-explored,
resulting in substantial impediments to the progress of factuality evaluators.
To mitigate this issue, we introduce a benchmark for Factuality Evaluation of
large Language Models, referred to as felm. In this benchmark, we collect
responses generated from LLMs and annotate factuality labels in a fine-grained
manner. Contrary to previous studies that primarily concentrate on the
factuality of world knowledge (e.g.~information from Wikipedia), felm focuses
on factuality across diverse domains, spanning from world knowledge to math and
reasoning. Our annotation is based on text segments, which can help pinpoint
specific factual errors. The factuality annotations are further supplemented by
predefined error types and reference links that either support or contradict
the statement. In our experiments, we investigate the performance of several
LLM-based factuality evaluators on felm, including both vanilla LLMs and those
augmented with retrieval mechanisms and chain-of-thought processes. Our
findings reveal that while retrieval aids factuality evaluation, current LLMs
are far from satisfactory to faithfully detect factual errors.
Related papers
- Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models [12.112914393948415]
We present RUPBench, a benchmark designed to evaluate large language models (LLMs) across diverse reasoning tasks.
Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning.
By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.
arXiv Detail & Related papers (2024-06-16T17:26:44Z) - FFT: Towards Harmlessness Evaluation and Analysis for LLMs with
Factuality, Fairness, Toxicity [21.539026782010573]
The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts.
Previous researchers have invested much effort in assessing the harmlessness of generative language models.
arXiv Detail & Related papers (2023-11-30T14:18:47Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Are Large Language Models Reliable Judges? A Study on the Factuality
Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities.
This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.