UFO: a Unified and Flexible Framework for Evaluating Factuality of Large
Language Models
- URL: http://arxiv.org/abs/2402.14690v1
- Date: Thu, 22 Feb 2024 16:45:32 GMT
- Title: UFO: a Unified and Flexible Framework for Evaluating Factuality of Large
Language Models
- Authors: Zhaoheng Huang, Zhicheng Dou, Yutao Zhu, Ji-rong Wen
- Abstract summary: Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or textithallucination.
We propose textttUFO, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources.
- Score: 73.73303148524398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) may generate text that lacks consistency with
human knowledge, leading to factual inaccuracies or \textit{hallucination}.
Existing research for evaluating the factuality of LLMs involves extracting
fact claims using an LLM and verifying them against a predefined fact source.
However, these evaluation metrics are task-specific, and not scalable, and the
substitutability of fact sources in different tasks is under-explored. To
address these challenges, we categorize four available fact sources:
human-written evidence, reference documents, search engine results, and LLM
knowledge, along with five text generation tasks containing six representative
datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible
evaluation framework to verify facts against plug-and-play fact sources. We
implement five evaluation scenarios based on this framework. Experimental
results show that for most QA tasks, human-written evidence and reference
documents are crucial, and they can substitute for each other in
retrieval-augmented QA tasks. In news fact generation tasks, search engine
results and LLM knowledge are essential. Our dataset and code are available at
\url{https://github.com/WaldenRUC/UFO}.
Related papers
- Unstructured Evidence Attribution for Long Context Query Focused Summarization [46.713307974729844]
Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query.
We show how existing systems struggle to generate and properly cite unstructured evidence from their context.
arXiv Detail & Related papers (2025-02-20T09:57:42Z) - Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset [52.286323454512996]
Large Language Models (LLMs) can comprehend and analyze hybrid text, containing textual and tabular data.
We propose an Automated Information Extraction framework (AIE) to enable LLMs to process the hybrid long documents (HLDs) and carry out experiments to analyse four important aspects of information extraction from HLDs.
To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset.
arXiv Detail & Related papers (2024-12-28T07:54:14Z) - Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.
It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.
We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.
FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content [13.187520657952263]
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet.
evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions.
We introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks.
arXiv Detail & Related papers (2024-06-17T17:52:54Z) - Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT [9.682499180341273]
Large language models (LLMs) have significantly advanced text generation, but the human-like quality of their outputs presents major challenges.
We propose CUDRT, a comprehensive evaluation framework and bilingual benchmark in Chinese and English.
This framework supports scalable, reproducible experiments and enables analysis of how operational diversity, multilingual training sets, and LLM architectures influence detection performance.
arXiv Detail & Related papers (2024-06-13T12:43:40Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.