OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
- URL: http://arxiv.org/abs/2405.05583v1
- Date: Thu, 9 May 2024 07:15:19 GMT
- Title: OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
- Authors: Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov,
- Abstract summary: OpenFactCheck is a unified factuality evaluation framework for large language models.
OpenFactCheck consists of three modules: (i) CUSTCHECKER, (ii) LLMEVAL, and (iii) CHECKEREVAL.
- Score: 27.89053798151106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different papers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified factuality evaluation framework for LLMs. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM's factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets. OpenFactCheck is publicly released at https://github.com/yuxiaw/OpenFactCheck.
Related papers
- Integrating Expert Knowledge into Logical Programs via LLMs [3.637365301757111]
ExKLoP is a framework designed to evaluate how effectively Large Language Models integrate expert knowledge into logical reasoning systems.
This capability is especially valuable in engineering, where expert knowledge-such as manufacturer-recommended operational ranges-can be directly embedded into automated monitoring systems.
arXiv Detail & Related papers (2025-02-17T19:18:23Z) - FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift toward fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.
We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.
Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z) - OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs [64.25176233153657]
OpenFactCheck is an open-sourced fact-checking framework for large language models.
It allows users to easily customize an automatic fact-checking system.
It also assesses the factuality of all claims in an input document using that system.
arXiv Detail & Related papers (2024-08-06T15:49:58Z) - Evidence-based Interpretable Open-domain Fact-checking with Large
Language Models [26.89527395822654]
We introduce the Open-domain Explainable Fact-checking (OE-Fact) system for claim-checking in real-world scenarios.
The OE-Fact system can leverage the powerful understanding and reasoning capabilities of large language models (LLMs) to validate claims.
Experimental results show that our OE-Fact system outperforms general fact-checking baseline systems in both closed- and open-domain scenarios.
arXiv Detail & Related papers (2023-12-10T09:27:50Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z) - Benchmarking Answer Verification Methods for Question Answering-Based
Summarization Evaluation Metrics [74.28810048824519]
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not.
We benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods.
arXiv Detail & Related papers (2022-04-21T15:43:45Z) - Generating Fact Checking Explanations [52.879658637466605]
A crucial piece of the puzzle that is still missing is to understand how to automate the most elaborate part of the process.
This paper provides the first study of how these explanations can be generated automatically based on available claim context.
Our results indicate that optimising both objectives at the same time, rather than training them separately, improves the performance of a fact checking system.
arXiv Detail & Related papers (2020-04-13T05:23:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.