Human-like Summarization Evaluation with ChatGPT
- URL: http://arxiv.org/abs/2304.02554v1
- Date: Wed, 5 Apr 2023 16:17:32 GMT
- Title: Human-like Summarization Evaluation with ChatGPT
- Authors: Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, Xiaojun
Wan
- Abstract summary: ChatGPT was able to complete annotations relatively smoothly using Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation.
It outperformed commonly used automatic evaluation metrics on some datasets.
- Score: 38.39767193442397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating text summarization is a challenging problem, and existing
evaluation metrics are far from satisfactory. In this study, we explored
ChatGPT's ability to perform human-like summarization evaluation using four
human evaluation methods on five datasets. We found that ChatGPT was able to
complete annotations relatively smoothly using Likert scale scoring, pairwise
comparison, Pyramid, and binary factuality evaluation. Additionally, it
outperformed commonly used automatic evaluation metrics on some datasets.
Furthermore, we discussed the impact of different prompts, compared its
performance with that of human evaluation, and analyzed the generated
explanations and invalid responses.
Related papers
- OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z) - Is Summary Useful or Not? An Extrinsic Human Evaluation of Text
Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods.
We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment.
We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - SummScore: A Comprehensive Evaluation Metric for Summary Quality Based
on Cross-Encoder [12.913447457411317]
SummScore is a comprehensive metric for summary quality evaluation based on CrossEncoder.
To improve the comprehensiveness and interpretability, SummScore consists of four fine-grained submodels.
Extensive experiments show that SummScore significantly outperforms existing evaluation metrics in the above four dimensions in correlation with human scoring.
arXiv Detail & Related papers (2022-07-11T06:47:29Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.