Towards Interpretable and Efficient Automatic Reference-Based
Summarization Evaluation
- URL: http://arxiv.org/abs/2303.03608v2
- Date: Thu, 16 Nov 2023 06:13:28 GMT
- Title: Towards Interpretable and Efficient Automatic Reference-Based
Summarization Evaluation
- Authors: Yixin Liu, Alexander R. Fabbri, Yilun Zhao, Pengfei Liu, Shafiq Joty,
Chien-Sheng Wu, Caiming Xiong, Dragomir Radev
- Abstract summary: Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics.
We develop strong-performing automatic metrics for reference-based summarization evaluation.
- Score: 160.07938471250048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interpretability and efficiency are two important considerations for the
adoption of neural automatic metrics. In this work, we develop
strong-performing automatic metrics for reference-based summarization
evaluation, based on a two-stage evaluation pipeline that first extracts basic
information units from one text sequence and then checks the extracted units in
another sequence. The metrics we developed include two-stage metrics that can
provide high interpretability at both the fine-grained unit level and summary
level, and one-stage metrics that achieve a balance between efficiency and
interpretability. We make the developed tools publicly available at
https://github.com/Yale-LILY/AutoACU.
Related papers
- Unlocking Structure Measuring: Introducing PDD, an Automatic Metric for Positional Discourse Coherence [39.065349875944634]
We present a novel metric designed to quantify the discourse divergence between two long-form articles.
Our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.
arXiv Detail & Related papers (2024-02-15T18:23:39Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Finding a Balanced Degree of Automation for Summary Evaluation [83.08810773093882]
We propose flexible semiautomatic to automatic summary evaluation metrics.
Semi-automatic Lite2Pyramid retains the reusable human-labeled Summary Content Units (SCUs) for reference(s)
Fully automatic Lite3Pyramid further substitutes SCUs with automatically extracted Semantic Triplet Units (STUs)
arXiv Detail & Related papers (2021-09-23T17:12:35Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.