OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
- URL: http://arxiv.org/abs/2405.05583v1
- Date: Thu, 9 May 2024 07:15:19 GMT
- Title: OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
- Authors: Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov,
- Abstract summary: OpenFactCheck is a unified factuality evaluation framework for large language models.
OpenFactCheck consists of three modules: (i) CUSTCHECKER, (ii) LLMEVAL, and (iii) CHECKEREVAL.
- Score: 27.89053798151106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different papers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified factuality evaluation framework for LLMs. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM's factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets. OpenFactCheck is publicly released at https://github.com/yuxiaw/OpenFactCheck.
Related papers
- OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs [64.25176233153657]
OpenFactCheck is an open-sourced fact-checking framework for large language models.
It allows users to easily customize an automatic fact-checking system.
It also assesses the factuality of all claims in an input document using that system.
arXiv Detail & Related papers (2024-08-06T15:49:58Z) - CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks [15.60762281287532]
Large Language Models (LLMs) are revolutionizing various domains, yet verifying their answers remains a significant challenge.
In this work, we propose CheckEmbed: an accurate, scalable, and simple LLM verification approach.
CheckEmbed is driven by a straightforward yet powerful idea: compare their corresponding answer-level embeddings obtained with a model such as GPT Text Embedding Large.
arXiv Detail & Related papers (2024-06-04T17:42:21Z) - GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence [64.95492752484171]
We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks.
We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users.
To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision.
arXiv Detail & Related papers (2024-02-19T21:45:55Z) - Evidence-based Interpretable Open-domain Fact-checking with Large
Language Models [26.89527395822654]
We introduce the Open-domain Explainable Fact-checking (OE-Fact) system for claim-checking in real-world scenarios.
The OE-Fact system can leverage the powerful understanding and reasoning capabilities of large language models (LLMs) to validate claims.
Experimental results show that our OE-Fact system outperforms general fact-checking baseline systems in both closed- and open-domain scenarios.
arXiv Detail & Related papers (2023-12-10T09:27:50Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z) - SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for
Generative Large Language Models [55.60306377044225]
"SelfCheckGPT" is a simple sampling-based approach to fact-check the responses of black-box models.
We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset.
arXiv Detail & Related papers (2023-03-15T19:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.