OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization
- URL: http://arxiv.org/abs/2310.18122v2
- Date: Sat, 11 Nov 2023 20:52:04 GMT
- Title: OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization
- Authors: Yuchen Shen, Xiaojun Wan
- Abstract summary: We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
- Score: 52.720711541731205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Opinion summarization sets itself apart from other types of summarization
tasks due to its distinctive focus on aspects and sentiments. Although certain
automated evaluation methods like ROUGE have gained popularity, we have found
them to be unreliable measures for assessing the quality of opinion summaries.
In this paper, we present OpinSummEval, a dataset comprising human judgments
and outputs from 14 opinion summarization models. We further explore the
correlation between 24 automatic metrics and human ratings across four
dimensions. Our findings indicate that metrics based on neural networks
generally outperform non-neural ones. However, even metrics built on powerful
backbones, such as BART and GPT-3/3.5, do not consistently correlate well
across all dimensions, highlighting the need for advancements in automated
evaluation methods for opinion summarization. The code and data are publicly
available at https://github.com/A-Chicharito-S/OpinSummEval/tree/main.
Related papers
- Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Is Summary Useful or Not? An Extrinsic Human Evaluation of Text
Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods.
We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment.
We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z) - Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics [53.779709191191685]
We propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics.
OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics.
We observe that existing metrics have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge.
arXiv Detail & Related papers (2021-05-19T04:45:07Z) - FFCI: A Framework for Interpretable Automatic Evaluation of
Summarization [43.375797352517765]
We propose FFCI, a framework for fine-grained summarization evaluation.
We construct a novel dataset for focus, coverage, and inter-sentential coherence.
We apply the developed metrics in evaluating a broad range of summarization models across two datasets.
arXiv Detail & Related papers (2020-11-27T10:57:18Z) - Metrics also Disagree in the Low Scoring Range: Revisiting Summarization
Evaluation Metrics [20.105119107290488]
One exemplar work concludes that automatic metrics strongly disagree when ranking high-scoring summaries.
We find that their observations stem from the fact that metrics disagree in ranking summaries from any narrow scoring range.
Apart from the width of the scoring range of summaries, we analyze three other properties that impact inter-metric agreement.
arXiv Detail & Related papers (2020-11-08T22:26:06Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.