Related papers: G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

URL: http://arxiv.org/abs/2303.16634v3
Date: Tue, 23 May 2023 22:12:16 GMT
Title: G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Authors: Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu and Chenguang Zhu
Abstract summary: We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
Score: 64.01972723692587
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators. In this work, we present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs. We experiment with two generation tasks, text summarization and dialogue generation. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin. We also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts. The code is at https://github.com/nlpyang/geval

Related papers

Text Chunking for Document Classification for Urban System Management using Large Language Models [0.0]
Urban systems are managed using complex textual documentation to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements.
arXiv Detail & Related papers (2025-03-31T22:48:30Z)
Benchmarking LLMs' Judgments with No Gold Standard [8.517244114791913]
We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs) In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner. We also present GRE-bench, which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers.
arXiv Detail & Related papers (2024-11-11T16:58:36Z)
Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4. We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods. Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z)
LLM-based NLG Evaluation: Current Status and Challenges [41.69249290537395]
evaluating natural language generation (NLG) is a vital but challenging problem in artificial intelligence. Large language models (LLMs) have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed.
arXiv Detail & Related papers (2024-02-02T13:06:35Z)
Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference? [3.2528685897001455]
This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages.
arXiv Detail & Related papers (2023-12-03T20:09:23Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z)
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z)
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references. We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.