Towards a Unified Multi-Dimensional Evaluator for Text Generation
- URL: http://arxiv.org/abs/2210.07197v1
- Date: Thu, 13 Oct 2022 17:17:03 GMT
- Title: Towards a Unified Multi-Dimensional Evaluator for Text Generation
- Authors: Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu,
Chenguang Zhu, Heng Ji and Jiawei Han
- Abstract summary: We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
- Score: 101.47008809623202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-dimensional evaluation is the dominant paradigm for human evaluation in
Natural Language Generation (NLG), i.e., evaluating the generated text from
multiple explainable dimensions, such as coherence and fluency. However,
automatic evaluation in NLG is still dominated by similarity-based metrics, and
we lack a reliable framework for a more comprehensive evaluation of advanced
models. In this paper, we propose a unified multi-dimensional evaluator UniEval
for NLG. We re-frame NLG evaluation as a Boolean Question Answering (QA) task,
and by guiding the model with different questions, we can use one evaluator to
evaluate from multiple dimensions. Furthermore, thanks to the unified Boolean
QA format, we are able to introduce an intermediate learning phase that enables
UniEval to incorporate external knowledge from multiple related tasks and gain
further improvement. Experiments on three typical NLG tasks show that UniEval
correlates substantially better with human judgments than existing metrics.
Specifically, compared to the top-performing unified evaluators, UniEval
achieves a 23% higher correlation on text summarization, and over 43% on
dialogue response generation. Also, UniEval demonstrates a strong zero-shot
learning ability for unseen evaluation dimensions and tasks. Source code, data
and all pre-trained evaluators are available on our GitHub repository
(https://github.com/maszhongming/UniEval).
Related papers
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs [19.097842830790405]
Existing benchmarks for summarization quality evaluation often lack diverse input scenarios and focus on narrowly defined dimensions.
We create UniSumEval benchmark, which extends the range of input context and provides fine-grained, multi-dimensional annotations.
arXiv Detail & Related papers (2024-09-30T02:56:35Z) - Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging [25.078498180620425]
We present a step-by-step evaluation framework, textbfFennec, capable of textbfFine-grained textbfEvaluatiotextbfN textbfExtended through brantextbfChing and bridging.
We employ the fine-grained correction capabilities induced by the evaluation model to refine multiple model responses, leading to an improvement of 1-2 points on the MT-Bench.
arXiv Detail & Related papers (2024-05-20T16:47:22Z) - X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects [32.50977115108103]
We introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users.
X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality.
arXiv Detail & Related papers (2023-11-15T09:01:55Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Multi-Dimensional Evaluation of Text Summarization with In-Context
Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning.
Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization.
We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.