Reference Matters: Benchmarking Factual Error Correction for Dialogue
Summarization with Fine-grained Evaluation Framework
- URL: http://arxiv.org/abs/2306.05119v1
- Date: Thu, 8 Jun 2023 11:41:39 GMT
- Title: Reference Matters: Benchmarking Factual Error Correction for Dialogue
Summarization with Fine-grained Evaluation Framework
- Authors: Mingqi Gao, Xiaojun Wan, Jia Su, Zhefeng Wang, Baoxing Huai
- Abstract summary: We are the first to manually annotate a FEC dataset for dialogue summarization containing 4000 items.
We propose FERRANTI, a fine-grained evaluation framework that automatically evaluates the performance of FEC models on different error categories.
- Score: 45.80315799254377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Factuality is important to dialogue summarization. Factual error correction
(FEC) of model-generated summaries is one way to improve factuality. Current
FEC evaluation that relies on factuality metrics is not reliable and detailed
enough. To address this problem, we are the first to manually annotate a FEC
dataset for dialogue summarization containing 4000 items and propose FERRANTI,
a fine-grained evaluation framework based on reference correction that
automatically evaluates the performance of FEC models on different error
categories. Using this evaluation framework, we conduct sufficient experiments
with FEC approaches under a variety of settings and find the best training
modes and significant differences in the performance of the existing approaches
on different factual error categories.
Related papers
- Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework [2.4861619769660637]
We propose an estimands framework adapted from international clinical trials guidelines.
This framework provides a systematic structure for inference and reporting in evaluations.
We demonstrate how the framework can help uncover underlying issues, their causes, and potential solutions.
arXiv Detail & Related papers (2024-06-14T18:47:37Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - CheckEval: Robust Evaluation Framework using Large Language Model via Checklist [6.713203569074019]
We introduce CheckEval, a novel evaluation framework using Large Language Models.
CheckEval addresses the challenges of ambiguity and inconsistency in current evaluation methods.
arXiv Detail & Related papers (2024-03-27T17:20:39Z) - Fine-grained and Explainable Factuality Evaluation for Multimodal
Summarization [15.438625459637896]
Multimodal summarization aims to generate a concise summary based on the input text and image.
To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks.
arXiv Detail & Related papers (2024-02-18T01:03:25Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification.
We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate.
We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Annotating and Detecting Fine-grained Factual Errors for Dialogue
Summarization [34.85353544844499]
We present the first dataset with fine-grained factual error annotations named DIASUMFACT.
We define fine-grained factual error detection as a sentence-level multi-label classification problem.
We propose an unsupervised model ENDERANKER via candidate ranking using pretrained encoder-decoder models.
arXiv Detail & Related papers (2023-05-26T00:18:33Z) - CLEME: Debiasing Multi-reference Evaluation for Grammatical Error
Correction [32.44051877804761]
Chunk-LEvel Multi-reference Evaluation (CLEME) is designed to evaluate Grammatical Error Correction (GEC) systems in the multi-reference evaluation setting.
We conduct experiments on six English reference sets based on the CoNLL-2014 shared task.
arXiv Detail & Related papers (2023-05-18T08:57:17Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.