Towards Interpretable Summary Evaluation via Allocation of Contextual
Embeddings to Reference Text Topics
- URL: http://arxiv.org/abs/2210.14174v1
- Date: Tue, 25 Oct 2022 17:09:08 GMT
- Title: Towards Interpretable Summary Evaluation via Allocation of Contextual
Embeddings to Reference Text Topics
- Authors: Ben Schaper, Christopher Lohse, Marcell Streile, Andrea Giovannini,
Richard Osuala
- Abstract summary: The multifaceted interpretable summary evaluation method (MISEM) is based on allocation of a summary's contextual token embeddings to semantic topics identified in the reference text.
MISEM achieves a promising.404 Pearson correlation with human judgment on the TAC'08 dataset.
- Score: 1.5749416770494706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite extensive recent advances in summary generation models, evaluation of
auto-generated summaries still widely relies on single-score systems
insufficient for transparent assessment and in-depth qualitative analysis.
Towards bridging this gap, we propose the multifaceted interpretable summary
evaluation method (MISEM), which is based on allocation of a summary's
contextual token embeddings to semantic topics identified in the reference
text. We further contribute an interpretability toolbox for automated summary
evaluation and interactive visual analysis of summary scoring, topic
identification, and token-topic allocation. MISEM achieves a promising .404
Pearson correlation with human judgment on the TAC'08 dataset.
Related papers
- SummIt: Iterative Text Summarization via ChatGPT [12.966825834765814]
We propose SummIt, an iterative text summarization framework based on large language models like ChatGPT.
Our framework enables the model to refine the generated summary iteratively through self-evaluation and feedback.
We also conduct a human evaluation to validate the effectiveness of the iterative refinements and identify a potential issue of over-correction.
arXiv Detail & Related papers (2023-05-24T07:40:06Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - FFCI: A Framework for Interpretable Automatic Evaluation of
Summarization [43.375797352517765]
We propose FFCI, a framework for fine-grained summarization evaluation.
We construct a novel dataset for focus, coverage, and inter-sentential coherence.
We apply the developed metrics in evaluating a broad range of summarization models across two datasets.
arXiv Detail & Related papers (2020-11-27T10:57:18Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - SueNes: A Weakly Supervised Approach to Evaluating Single-Document
Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries.
Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.