Factual Consistency Evaluation of Summarisation in the Era of Large
Language Models
- URL: http://arxiv.org/abs/2402.13758v1
- Date: Wed, 21 Feb 2024 12:35:19 GMT
- Title: Factual Consistency Evaluation of Summarisation in the Era of Large
Language Models
- Authors: Zheheng Luo, Qianqian Xie, Sophia Ananiadou
- Abstract summary: Existing factual consistency metrics are constrained by their performance, efficiency, and explainability.
Recent advances in Large language models (LLMs) have demonstrated remarkable potential in text evaluation.
- Score: 38.8292168447796
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Factual inconsistency with source documents in automatically generated
summaries can lead to misinformation or pose risks. Existing factual
consistency(FC) metrics are constrained by their performance, efficiency, and
explainability. Recent advances in Large language models (LLMs) have
demonstrated remarkable potential in text evaluation but their effectiveness in
assessing FC in summarisation remains underexplored. Prior research has mostly
focused on proprietary LLMs, leaving essential factors that affect their
assessment capabilities unexplored. Additionally, current FC evaluation
benchmarks are restricted to news articles, casting doubt on the generality of
the FC methods tested on them. In this paper, we first address the gap by
introducing TreatFact a dataset of LLM-generated summaries of clinical texts,
annotated for FC by domain experts. Moreover, we benchmark 11 LLMs for FC
evaluation across news and clinical domains and analyse the impact of model
size, prompts, pre-training and fine-tuning data. Our findings reveal that
despite proprietary models prevailing on the task, open-source LLMs lag behind.
Nevertheless, there is potential for enhancing the performance of open-source
LLMs through increasing model size, expanding pre-training data, and developing
well-curated fine-tuning data. Experiments on TreatFact suggest that both
previous methods and LLM-based evaluators are unable to capture factual
inconsistencies in clinical summaries, posing a new challenge for FC
evaluation.
Related papers
- Towards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models [0.0]
This study explores the effectiveness of various in-context learning strategies in language models (LMs) across benchmark datasets.
We employ a large language model (LLM) self-evaluation approach using chain-of-thought reasoning and assess its correlation with human-aligned metrics like BERTScore.
Our findings highlight the significant impact of examples in improving table-to-text generation and suggest that, while LLM self-evaluation has potential, its current alignment with human judgment could be enhanced.
arXiv Detail & Related papers (2024-10-15T09:19:42Z) - FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom [19.104850413126066]
Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs)
Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers.
We propose FedEval-LLM that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools.
arXiv Detail & Related papers (2024-04-18T15:46:26Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs.
Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods.
This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Are Large Language Models Reliable Judges? A Study on the Factuality
Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities.
This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.