FFCI: A Framework for Interpretable Automatic Evaluation of
Summarization
- URL: http://arxiv.org/abs/2011.13662v3
- Date: Mon, 28 Feb 2022 02:03:28 GMT
- Title: FFCI: A Framework for Interpretable Automatic Evaluation of
Summarization
- Authors: Fajri Koto and Timothy Baldwin and Jey Han Lau
- Abstract summary: We propose FFCI, a framework for fine-grained summarization evaluation.
We construct a novel dataset for focus, coverage, and inter-sentential coherence.
We apply the developed metrics in evaluating a broad range of summarization models across two datasets.
- Score: 43.375797352517765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose FFCI, a framework for fine-grained summarization
evaluation that comprises four elements: faithfulness (degree of factual
consistency with the source), focus (precision of summary content relative to
the reference), coverage (recall of summary content relative to the reference),
and inter-sentential coherence (document fluency between adjacent sentences).
We construct a novel dataset for focus, coverage, and inter-sentential
coherence, and develop automatic methods for evaluating each of the four
dimensions of FFCI based on cross-comparison of evaluation metrics and
model-based evaluation methods, including question answering (QA) approaches,
semantic textual similarity (STS), next-sentence prediction (NSP), and scores
derived from 19 pre-trained language models. We then apply the developed
metrics in evaluating a broad range of summarization models across two
datasets, with some surprising findings.
Related papers
- SWING: Balancing Coverage and Faithfulness for Dialogue Summarization [67.76393867114923]
We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies.
We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered.
Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-01-25T09:33:11Z) - Towards Interpretable Summary Evaluation via Allocation of Contextual
Embeddings to Reference Text Topics [1.5749416770494706]
The multifaceted interpretable summary evaluation method (MISEM) is based on allocation of a summary's contextual token embeddings to semantic topics identified in the reference text.
MISEM achieves a promising.404 Pearson correlation with human judgment on the TAC'08 dataset.
arXiv Detail & Related papers (2022-10-25T17:09:08Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.