Related papers: A Closer Look into Automatic Evaluation Using Large Language Models

A Closer Look into Automatic Evaluation Using Large Language Models

URL: http://arxiv.org/abs/2310.05657v1
Date: Mon, 9 Oct 2023 12:12:55 GMT
Title: A Closer Look into Automatic Evaluation Using Large Language Models
Authors: Cheng-Han Chiang and Hung-yi Lee
Abstract summary: We discuss how details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal.
Score: 75.49360351036773
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Using large language models (LLMs) to evaluate text quality has recently gained popularity. Some prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.

Related papers

On Evaluating LLM Alignment by Evaluating LLMs as Judges [68.15541137648721]
evaluating large language models' (LLMs) alignment requires them to be helpful, honest, safe, and to precisely follow human instructions.<n>We examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences.<n>We propose a benchmark that assesses alignment without directly evaluating model outputs.
arXiv Detail & Related papers (2025-11-25T18:33:24Z)
Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons [13.187011661009459]
Large Language Models (LLMs) have shown to be effective evaluators across various domains.<n>We present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons.
arXiv Detail & Related papers (2025-06-04T09:46:43Z)
GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking [0.9614204956530676]
We introduce GLIDER, a powerful 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria. GLIDER shows higher Pearson's correlation than GPT-4o on FLASK and greatly outperforms prior evaluation models. It supports fine-grained scoring, multilingual reasoning, span highlighting and was trained on 685 domains and 183 criteria.
arXiv Detail & Related papers (2024-12-18T18:41:12Z)
Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
METAL: Towards Multilingual Meta-Evaluation [12.852595634767901]
This study proposes a framework for an end-to-end assessment of Large Language Models (LLMs) as evaluators in multilingual scenarios. We create a dataset covering 10 languages containing native speaker judgments for the task of summarization. We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2.
arXiv Detail & Related papers (2024-04-02T06:14:54Z)
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework. We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z)
LLMEval: A Preliminary Study on How to Evaluate Large Language Models [47.12588320134504]
We analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4. A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results.
arXiv Detail & Related papers (2023-12-12T16:14:43Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations [10.709365940160685]
Modern large language models (LLMs) are hard to evaluate and compare automatically. We propose a peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs. We find that our approaches achieve higher accuracy and align better with human judgments.
arXiv Detail & Related papers (2023-07-06T04:05:44Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.