Related papers: Towards a Unified Multi-Dimensional Evaluator for Text Generation

Towards a Unified Multi-Dimensional Evaluator for Text Generation

URL: http://arxiv.org/abs/2210.07197v1
Date: Thu, 13 Oct 2022 17:17:03 GMT
Title: Towards a Unified Multi-Dimensional Evaluator for Text Generation
Authors: Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji and Jiawei Han
Abstract summary: We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG) We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions. Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
Score: 101.47008809623202
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions. Furthermore, thanks to the unified Boolean QA format, we are able to introduce an intermediate learning phase that enables UniEval to incorporate external knowledge from multiple related tasks and gain further improvement. Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics. Specifically, compared to the top-performing unified evaluators, UniEval achieves a 23% higher correlation on text summarization, and over 43% on dialogue response generation. Also, UniEval demonstrates a strong zero-shot learning ability for unseen evaluation dimensions and tasks. Source code, data and all pre-trained evaluators are available on our GitHub repository (https://github.com/maszhongming/UniEval).

Related papers

SedarEval: Automated Evaluation using Self-Adaptive Rubrics [4.97150240417381]
We propose a new evaluation paradigm based on self-adaptive rubrics. SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric. We train a specialized evaluator language model (evaluator LM) to supplant human graders.
arXiv Detail & Related papers (2025-01-26T16:45:09Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs [19.097842830790405]
Existing benchmarks for summarization quality evaluation often lack diverse input scenarios and focus on narrowly defined dimensions. We create UniSumEval benchmark, which extends the range of input context and provides fine-grained, multi-dimensional annotations.
arXiv Detail & Related papers (2024-09-30T02:56:35Z)
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging [25.078498180620425]
We present a step-by-step evaluation framework, textbfFennec, capable of textbfFine-grained textbfEvaluatiotextbfN textbfExtended through brantextbfChing and bridging. We employ the fine-grained correction capabilities induced by the evaluation model to refine multiple model responses, leading to an improvement of 1-2 points on the MT-Bench.
arXiv Detail & Related papers (2024-05-20T16:47:22Z)
X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects [32.50977115108103]
We introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users. X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality.
arXiv Detail & Related papers (2023-11-15T09:01:55Z)
DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task. We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z)
Multi-Dimensional Evaluation of Text Summarization with In-Context Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning. Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization. We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.