SAFEval: Summarization Asks for Fact-based Evaluation
- URL: http://arxiv.org/abs/2103.12693v1
- Date: Tue, 23 Mar 2021 17:16:09 GMT
- Title: SAFEval: Summarization Asks for Fact-based Evaluation
- Authors: Thomas Scialom, Paul-Alexis Dray, Patrick Gallinari, Sylvain Lamprier,
Benjamin Piwowarski, Jacopo Staiano, Alex Wang
- Abstract summary: We extend previous approaches and propose a unified framework, named SAFEval.
In contrast to established metrics such as ROUGE or BERTScore, SAFEval does not require any ground-truth reference.
We show that SAFEval substantially improves the correlation with human judgments over four evaluation dimensions.
- Score: 40.02686002117778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Summarization evaluation remains an open research problem: current metrics
such as ROUGE are known to be limited and to correlate poorly with human
judgments. To alleviate this issue, recent work has proposed evaluation metrics
which rely on question answering models to assess whether a summary contains
all the relevant information in its source document. Though promising, the
proposed approaches have so far failed to correlate better than ROUGE with
human judgments.
In this paper, we extend previous approaches and propose a unified framework,
named SAFEval. In contrast to established metrics such as ROUGE or BERTScore,
SAFEval does not require any ground-truth reference. Nonetheless, SAFEval
substantially improves the correlation with human judgments over four
evaluation dimensions (consistency, coherence, fluency, and relevance), as
shown in the extensive experiments we report.
Related papers
- Challenges and Considerations in the Evaluation of Bayesian Causal Discovery [49.0053848090947]
Representing uncertainty in causal discovery is a crucial component for experimental design, and more broadly, for safe and reliable causal decision making.
Unlike non-Bayesian causal discovery, which relies on a single estimated causal graph and model parameters for assessment, causal discovery presents challenges due to the nature of its quantity.
No consensus on the most suitable metric for evaluation.
arXiv Detail & Related papers (2024-06-05T12:45:23Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - One Prompt To Rule Them All: LLMs for Opinion Summary Evaluation [30.674896082482476]
We show that Op-I-Prompt emerges as a good alternative for evaluating opinion summaries achieving an average Spearman correlation of 0.70 with humans.
To the best of our knowledge, we are the first to investigate LLMs as evaluators on both closed-source and open-source models in the opinion summarization domain.
arXiv Detail & Related papers (2024-02-18T19:13:52Z) - SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation.
SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z) - DocAsRef: An Empirical Study on Repurposing Reference-Based Summary
Quality Metrics Reference-Freely [29.4981129248937]
We propose that some reference-based metrics can be effectively adapted to assess a system summary against its corresponding reference.
After being repurposed reference-freely, the zero-shot BERTScore consistently outperforms its original reference-based version.
It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.
arXiv Detail & Related papers (2022-12-20T06:01:13Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - WIDAR -- Weighted Input Document Augmented ROUGE [26.123086537577155]
The proposed metric WIDAR is designed to adapt the evaluation score according to the quality of the reference summary.
The proposed metric correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in coherence, consistency, fluency, and relevance on human judgement scores.
arXiv Detail & Related papers (2022-01-23T14:40:42Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - Estimation of Summary-to-Text Inconsistency by Mismatched Embeddings [0.0]
We propose a new reference-free summary quality evaluation measure, with emphasis on the faithfulness.
The proposed ESTIME, Estimator of Summary-to-Text Inconsistency by Mismatched Embeddings, correlates with expert scores in summary-level SummEval dataset stronger than other common evaluation measures.
arXiv Detail & Related papers (2021-04-12T01:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.