OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
- URL: http://arxiv.org/abs/2408.11832v2
- Date: Wed, 6 Nov 2024 18:07:03 GMT
- Title: OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
- Authors: Hasan Iqbal, Yuxia Wang, Minghan Wang, Georgi Georgiev, Jiahui Geng, Iryna Gurevych, Preslav Nakov,
- Abstract summary: OpenFactCheck is an open-sourced fact-checking framework for large language models.
It allows users to easily customize an automatic fact-checking system.
It also assesses the factuality of all claims in an input document using that system.
- Score: 64.25176233153657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures, which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.
Related papers
- HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying Real-World Claims [6.792233590302494]
We introduce a system that only employs publicly available large language models (LLMs) for each step of automated fact-checking.
For evidence retrieval, a language model is used to enhance a query by generating hypothetical fact-checking documents.
HerO achieved 2nd place on the leaderboard with the AVeriTeC score of 0.57, suggesting the potential of open LLMs for verifying real-world claims.
arXiv Detail & Related papers (2024-10-16T08:49:17Z) - Evidence-backed Fact Checking using RAG and Few-Shot In-Context Learning with LLMs [9.785096589765908]
We use the Averitec dataset to assess the performance of our fact-checking system.
In addition to veracity prediction, our system provides supporting evidence, which is extracted from the dataset.
Our system achieves an 'Averitec' score of 0.33, which is a 22% absolute improvement over the baseline.
arXiv Detail & Related papers (2024-08-22T01:42:34Z) - OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs [27.89053798151106]
OpenFactCheck is a unified factuality evaluation framework for large language models.
OpenFactCheck consists of three modules: (i) CUSTCHECKER, (ii) LLMEVAL, and (iii) CHECKEREVAL.
arXiv Detail & Related papers (2024-05-09T07:15:19Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z) - SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for
Generative Large Language Models [55.60306377044225]
"SelfCheckGPT" is a simple sampling-based approach to fact-check the responses of black-box models.
We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset.
arXiv Detail & Related papers (2023-03-15T19:31:21Z) - WikiCheck: An end-to-end open source Automatic Fact-Checking API based
on Wikipedia [1.14219428942199]
We review the State-of-the-Art datasets and solutions for Automatic Fact-checking.
We propose a data filtering method that improves the model's performance and generalization.
We present a new fact-checking system, the textitWikiCheck API that automatically performs a facts validation process based on the Wikipedia knowledge base.
arXiv Detail & Related papers (2021-09-02T10:45:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.