A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization
- URL: http://arxiv.org/abs/2406.09972v1
- Date: Fri, 14 Jun 2024 12:31:44 GMT
- Title: A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization
- Authors: KuanChao Chu, Yi-Pei Chen, Hideki Nakayama,
- Abstract summary: This research investigates prompt designs of evaluating generated texts using large language models (LLMs)
We found that the order of presenting reasons and scores significantly influences LLMs' scoring.
An additional optimization may enhance scoring alignment if sufficient data is available.
- Score: 17.38671584773247
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This research investigates prompt designs of evaluating generated texts using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for open-ended text evaluation remains challenging due to model sensitivity and subjectivity in evaluation of text generation. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a different level of rule understanding in the prompt. An additional optimization may enhance scoring alignment if sufficient data is available. This insight is crucial for improving the accuracy and consistency of LLM-based evaluations.
Related papers
- HPSS: Heuristic Prompting Strategy Search for LLM Evaluators [81.09765876000208]
We propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS)
Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for evaluators.
Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS.
arXiv Detail & Related papers (2025-02-18T16:46:47Z) - Benchmarking Prompt Sensitivity in Large Language Models [13.986971540998258]
Large language Models (LLMs) are highly sensitive to variations in prompt formulation.
This paper introduces a new task, Prompt Sensitivity Prediction, and a dataset designed to investigate the effects of slight prompt variations on LLM performance.
arXiv Detail & Related papers (2025-02-09T23:01:03Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation [17.38671584773247]
This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs)
We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a "reason-first" approach yielding more comprehensive evaluations.
arXiv Detail & Related papers (2024-06-05T02:25:10Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - ALLURE: Auditing and Improving LLM-based Evaluation of Text using
Iterative In-Context-Learning [7.457517083017178]
Large language models (LLMs) are used for evaluation of text generated by humans and AI alike.
Despite their utility, LLMs exhibit distinct failure modes, necessitating a thorough audit and improvement of their text evaluation capabilities.
Here we introduce ALLURE, a systematic approach to Auditing Large Language Models Understanding and Reasoning Errors.
arXiv Detail & Related papers (2023-09-24T17:15:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.