UFO: a Unified and Flexible Framework for Evaluating Factuality of Large
Language Models
- URL: http://arxiv.org/abs/2402.14690v1
- Date: Thu, 22 Feb 2024 16:45:32 GMT
- Title: UFO: a Unified and Flexible Framework for Evaluating Factuality of Large
Language Models
- Authors: Zhaoheng Huang, Zhicheng Dou, Yutao Zhu, Ji-rong Wen
- Abstract summary: Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or textithallucination.
We propose textttUFO, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources.
- Score: 73.73303148524398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) may generate text that lacks consistency with
human knowledge, leading to factual inaccuracies or \textit{hallucination}.
Existing research for evaluating the factuality of LLMs involves extracting
fact claims using an LLM and verifying them against a predefined fact source.
However, these evaluation metrics are task-specific, and not scalable, and the
substitutability of fact sources in different tasks is under-explored. To
address these challenges, we categorize four available fact sources:
human-written evidence, reference documents, search engine results, and LLM
knowledge, along with five text generation tasks containing six representative
datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible
evaluation framework to verify facts against plug-and-play fact sources. We
implement five evaluation scenarios based on this framework. Experimental
results show that for most QA tasks, human-written evidence and reference
documents are crucial, and they can substitute for each other in
retrieval-augmented QA tasks. In news fact generation tasks, search engine
results and LLM knowledge are essential. Our dataset and code are available at
\url{https://github.com/WaldenRUC/UFO}.
Related papers
- FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.
FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization [0.27624021966289597]
This paper introduces EYEGLAXS, a framework that leverages Large Language Models (LLMs) for extractive summarization.
EYEGLAXS focuses on extractive summarization to ensure factual and grammatical integrity.
The system sets new performance benchmarks on well-known datasets like PubMed and ArXiv.
arXiv Detail & Related papers (2024-08-28T13:52:19Z) - Attribute or Abstain: Large Language Models as Long Document Assistants [58.32043134560244]
LLMs can help humans working with long documents, but are known to hallucinate.
Existing approaches to attribution have only been evaluated in RAG settings, where the initial retrieval confounds LLM performance.
This is crucially different from the long document setting, where retrieval is not needed, but could help.
We present LAB, a benchmark of 6 diverse long document tasks with attribution, and experiments with different approaches to attribution on 5 LLMs of different sizes.
arXiv Detail & Related papers (2024-07-10T16:16:02Z) - Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content [13.187520657952263]
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet.
evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions.
We introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks.
arXiv Detail & Related papers (2024-06-17T17:52:54Z) - Peering into the Mind of Language Models: An Approach for Attribution in Contextual Question Answering [9.86691461253151]
We introduce a novel method for attribution in contextual question answering, leveraging the hidden state representations of large language models (LLMs)
Our approach bypasses the need for extensive model retraining and retrieval model overhead, offering granular attributions and preserving the quality of generated answers.
We present Verifiability-granular, an attribution dataset which has token level annotations for LLM generations in the contextual question answering setup.
arXiv Detail & Related papers (2024-05-28T09:12:44Z) - When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp.
Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment.
Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Enabling Large Language Models to Generate Text with Citations [37.64884969997378]
Large language models (LLMs) have emerged as a widely-used tool for information seeking.
Our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability.
We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation.
arXiv Detail & Related papers (2023-05-24T01:53:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.