CoAScore: Chain-of-Aspects Prompting for NLG Evaluation
- URL: http://arxiv.org/abs/2312.10355v1
- Date: Sat, 16 Dec 2023 06:57:20 GMT
- Title: CoAScore: Chain-of-Aspects Prompting for NLG Evaluation
- Authors: Peiyuan Gong and Jiaxin Mao
- Abstract summary: Natural language generation (NLG) evaluation has shifted from a single-aspect to a multi-aspect paradigm.
We propose an NLG evaluation metric called CoAScore, powered by large language models (LLMs)
Our experimental findings highlight that, in comparison to individual aspect evaluation, CoAScore exhibits a higher correlation with human judgments.
- Score: 15.040372431669093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, natural language generation (NLG) evaluation has shifted from a
single-aspect to a multi-aspect paradigm, allowing for a more accurate
assessment. Large language models (LLMs) achieve superior performance on
various NLG evaluation tasks. However, current work often employs the LLM to
independently evaluate different aspects, which largely ignores the rich
correlation between various aspects. To fill this research gap, in this work,
we propose an NLG evaluation metric called CoAScore. Powered by LLMs, the
CoAScore utilizes multi-aspect knowledge through a CoA
(\textbf{C}hain-\textbf{o}f-\textbf{A}spects) prompting framework when
assessing the quality of a certain aspect. Specifically, for a given aspect to
evaluate, we first prompt the LLM to generate a chain of aspects that are
relevant to the target aspect and could be useful for the evaluation. We then
collect evaluation scores for each generated aspect, and finally, leverage the
knowledge of these aspects to improve the evaluation of the target aspect. We
evaluate CoAScore across five NLG evaluation tasks (e.g., summarization, dialog
response generation, etc) and nine aspects (e.g., overall quality, relevance,
coherence, etc). Our experimental findings highlight that, in comparison to
individual aspect evaluation, CoAScore exhibits a higher correlation with human
judgments. This improvement significantly outperforms existing unsupervised
evaluation metrics, whether for assessing overall quality or other aspects. We
also conducted extensive ablation studies to validate the effectiveness of the
three stages within the CoAScore framework and conducted case studies to show
how the LLM performs in these stages. Our code and scripts are available.
Related papers
- From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management [6.70908766695241]
This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations.
Our results suggest that GPT ratings are comparable to human ratings but exhibit higher consistency and reliability.
Our research suggests that while LLMs are capable of extracting meaningful constructs from text-based data, their scope is currently limited to specific forms of performance evaluation.
arXiv Detail & Related papers (2024-08-09T20:35:10Z) - Large Language Models as Evaluators for Recommendation Explanations [23.938202791437337]
We investigate whether LLMs can serve as evaluators of recommendation explanations.
We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users.
Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts.
arXiv Detail & Related papers (2024-06-05T13:23:23Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - The Generative AI Paradox on Evaluation: What It Can Solve, It May Not
Evaluate [17.77014177096838]
This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators.
We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA dataset.
arXiv Detail & Related papers (2024-02-09T06:16:08Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects [32.50977115108103]
We introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users.
X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality.
arXiv Detail & Related papers (2023-11-15T09:01:55Z) - Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.