Related papers: LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models

LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models

URL: http://arxiv.org/abs/2307.07889v3
Date: Tue, 6 Feb 2024 17:05:58 GMT
Title: LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models
Authors: Adian Liusie, Potsawee Manakul, Mark J. F. Gales
Abstract summary: Large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks. This paper explores two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment. For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring.
Score: 55.60306377044225
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current developments in large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks. An interesting application of these systems is in the automated assessment of natural language generation (NLG), a highly challenging area with great practical benefit. In this paper, we explore two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment: absolute score prediction, and comparative assessment which uses relative comparisons between pairs of candidates. Though comparative assessment has not been extensively studied in NLG assessment, we note that humans often find it more intuitive to compare two options rather than scoring each one independently. This work examines comparative assessment from multiple perspectives: performance compared to absolute grading; positional biases in the prompt; and efficient ranking in terms of the number of comparisons. We illustrate that LLM comparative assessment is a simple, general and effective approach for NLG assessment. For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring, and in many cases can achieve performance competitive with state-of-the-art methods. Additionally, we demonstrate that LLMs often exhibit strong positional biases when making pairwise comparisons, and we propose debiasing methods that can further improve performance.

Related papers

LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models [0.46040036610482665]
We propose Comparative Essay Scoring (LCES), a method that formulates AES as a pairwise comparison task.<n>Specifically, we instruct LLMs to judge which of two essays is better, collect many such comparisons, and convert them into continuous scores.<n>Experiments using AES benchmark datasets show that LCES outperforms conventional zero-shot methods in accuracy while maintaining computational efficiency.
arXiv Detail & Related papers (2025-05-13T12:26:16Z)
Finetuning LLMs for Comparative Assessment Tasks [9.05771474043499]
We propose a framework for finetuning large language models for comparative assessment. By training on soft probabilities, our approach improves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-24T11:21:43Z)
The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators [31.520403357740317]
Large language models (LLMs) are increasingly used as evaluators for natural language generation tasks. LLMs display biased preferences, such as favoring verbosity and authoritative tones. We introduce PRePair, which integrates pointwise reasoning within a pairwise framework.
arXiv Detail & Related papers (2024-06-18T06:43:04Z)
Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons [10.94304714004328]
This paper introduces a Product of Expert (PoE) framework for efficient Comparative Assessment. Individual comparisons are considered experts that provide information on a pair's score difference. PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates.
arXiv Detail & Related papers (2024-05-09T16:45:27Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities in assessing the quality of generated natural language. LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. We introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations [10.709365940160685]
Modern large language models (LLMs) are hard to evaluate and compare automatically. We propose a peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs. We find that our approaches achieve higher accuracy and align better with human judgments.
arXiv Detail & Related papers (2023-07-06T04:05:44Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.