ChatGPT as a Factual Inconsistency Evaluator for Text Summarization
- URL: http://arxiv.org/abs/2303.15621v2
- Date: Thu, 13 Apr 2023 10:59:39 GMT
- Title: ChatGPT as a Factual Inconsistency Evaluator for Text Summarization
- Authors: Zheheng Luo, Qianqian Xie, Sophia Ananiadou
- Abstract summary: We show that ChatGPT can evaluate factual inconsistency under a zero-shot setting.
It generally outperforms previous evaluation metrics on binary entailment inference, summary ranking, and consistency rating.
However, a closer inspection of ChatGPT's output reveals certain limitations including its preference for more lexically similar candidates, false reasoning, and inadequate understanding of instructions.
- Score: 17.166794984161964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of text summarization has been greatly boosted by pre-trained
language models. A main concern of existing methods is that most generated
summaries are not factually inconsistent with their source documents. To
alleviate the problem, many efforts have focused on developing effective
factuality evaluation metrics based on natural language inference, question
answering, and syntactic dependency et al. However, these approaches are
limited by either their high computational complexity or the uncertainty
introduced by multi-component pipelines, resulting in only partial agreement
with human judgement. Most recently, large language models(LLMs) have shown
excellent performance in not only text generation but also language
comprehension. In this paper, we particularly explore ChatGPT's ability to
evaluate factual inconsistency under a zero-shot setting by examining it on
both coarse-grained and fine-grained evaluation tasks including binary
entailment inference, summary ranking, and consistency rating. Experimental
results indicate that ChatGPT generally outperforms previous evaluation metrics
across the three tasks, indicating its great potential for factual
inconsistency evaluation. However, a closer inspection of ChatGPT's output
reveals certain limitations including its preference for more lexically similar
candidates, false reasoning, and inadequate understanding of instructions.
Related papers
- Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - SOUL: Towards Sentiment and Opinion Understanding of Language [96.74878032417054]
We propose a new task called Sentiment and Opinion Understanding of Language (SOUL)
SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
arXiv Detail & Related papers (2023-10-27T06:48:48Z) - GUMSum: Multi-Genre Data and Evaluation for English Abstractive
Summarization [10.609715843964263]
Automatic summarization with pre-trained language models has led to impressively fluent results, but is prone to 'hallucinations'
We present GUMSum, a dataset of English summaries in 12 written and spoken genres for evaluation of abstractive summarization.
arXiv Detail & Related papers (2023-06-20T03:21:10Z) - Extractive Summarization via ChatGPT for Faithful Summary Generation [12.966825834765814]
This paper presents a thorough evaluation of ChatGPT's performance on extractive summarization.
We find that ChatGPT exhibits inferior extractive summarization performance in terms of ROUGE scores compared to existing supervised systems.
Applying an extract-then-generate pipeline with ChatGPT yields significant performance improvements over abstractive baselines in terms of summary faithfulness.
arXiv Detail & Related papers (2023-04-09T08:26:04Z) - Exploring the Use of Large Language Models for Reference-Free Text
Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference.
The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - Exploring the Limits of ChatGPT for Query or Aspect-based Text
Summarization [28.104696513516117]
Large language models (LLMs) like GPT3 and ChatGPT have recently created significant interest in using these models for text summarization tasks.
Recent studies citegoyal2022news, zhang2023benchmarking have shown that LLMs-generated news summaries are already on par with humans.
Our experiments reveal that ChatGPT's performance is comparable to traditional fine-tuning methods in terms of Rouge scores.
arXiv Detail & Related papers (2023-02-16T04:41:30Z) - Factual Consistency Evaluation for Text Summarization via Counterfactual
Estimation [42.63902468258758]
We propose a novel metric to evaluate the factual consistency in text summarization via counterfactual estimation.
We conduct a series of experiments on three public abstractive text summarization datasets.
arXiv Detail & Related papers (2021-08-30T11:48:41Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.