UFO: a Unified and Flexible Framework for Evaluating Factuality of Large
Language Models
- URL: http://arxiv.org/abs/2402.14690v1
- Date: Thu, 22 Feb 2024 16:45:32 GMT
- Title: UFO: a Unified and Flexible Framework for Evaluating Factuality of Large
Language Models
- Authors: Zhaoheng Huang, Zhicheng Dou, Yutao Zhu, Ji-rong Wen
- Abstract summary: Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or textithallucination.
We propose textttUFO, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources.
- Score: 73.73303148524398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) may generate text that lacks consistency with
human knowledge, leading to factual inaccuracies or \textit{hallucination}.
Existing research for evaluating the factuality of LLMs involves extracting
fact claims using an LLM and verifying them against a predefined fact source.
However, these evaluation metrics are task-specific, and not scalable, and the
substitutability of fact sources in different tasks is under-explored. To
address these challenges, we categorize four available fact sources:
human-written evidence, reference documents, search engine results, and LLM
knowledge, along with five text generation tasks containing six representative
datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible
evaluation framework to verify facts against plug-and-play fact sources. We
implement five evaluation scenarios based on this framework. Experimental
results show that for most QA tasks, human-written evidence and reference
documents are crucial, and they can substitute for each other in
retrieval-augmented QA tasks. In news fact generation tasks, search engine
results and LLM knowledge are essential. Our dataset and code are available at
\url{https://github.com/WaldenRUC/UFO}.
Related papers
- DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems [99.17123445211115]
We introduce DocBench, a benchmark to evaluate large language model (LLM)-based document reading systems.
Our benchmark involves the recruitment of human annotators and the generation of synthetic questions.
It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions.
arXiv Detail & Related papers (2024-07-15T13:17:42Z) - Attribute or Abstain: Large Language Models as Long Document Assistants [58.32043134560244]
We present a benchmark of 6 diverse long document tasks with attribution, and experiment with different approaches to attribution on 4 long documents.
We find that citation, i.e. response generation and evidence extraction in one step, mostly performs best.
We also find that evidence quality can predict response quality on datasets with simple responses, but not so for complex responses.
arXiv Detail & Related papers (2024-07-10T16:16:02Z) - Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content [13.187520657952263]
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet.
evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions.
We introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks.
arXiv Detail & Related papers (2024-06-17T17:52:54Z) - Peering into the Mind of Language Models: An Approach for Attribution in Contextual Question Answering [9.86691461253151]
We introduce a novel method for attribution in contextual question answering, leveraging the hidden state representations of large language models (LLMs)
Our approach bypasses the need for extensive model retraining and retrieval model overhead, offering granular attributions and preserving the quality of generated answers.
We present Verifiability-granular, an attribution dataset which has token level annotations for LLM generations in the contextual question answering setup.
arXiv Detail & Related papers (2024-05-28T09:12:44Z) - Benchmarking LLMs on the Semantic Overlap Summarization Task [9.656095701778975]
This paper comprehensively evaluates Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task.
We report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives.
arXiv Detail & Related papers (2024-02-26T20:33:50Z) - When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp.
Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment.
Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Enabling Large Language Models to Generate Text with Citations [37.64884969997378]
Large language models (LLMs) have emerged as a widely-used tool for information seeking.
Our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability.
We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation.
arXiv Detail & Related papers (2023-05-24T01:53:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.