HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
- URL: http://arxiv.org/abs/2502.13031v1
- Date: Tue, 18 Feb 2025 16:46:47 GMT
- Title: HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
- Authors: Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang,
- Abstract summary: We propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS)
Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for evaluators.
Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS.
- Score: 81.09765876000208
- License:
- Abstract: Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods.
Related papers
- An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases.
In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits.
Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z) - On Speeding Up Language Model Evaluation [48.51924035873411]
Development of prompt-based methods with Large Language Models (LLMs) requires making numerous decisions.
We propose a novel method to address this challenge.
We show that it can identify the top-performing method using only 5-15% of the typically needed resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z) - A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization [17.38671584773247]
This research investigates prompt designs of evaluating generated texts using large language models (LLMs)
We found that the order of presenting reasons and scores significantly influences LLMs' scoring.
An additional optimization may enhance scoring alignment if sufficient data is available.
arXiv Detail & Related papers (2024-06-14T12:31:44Z) - Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs.
We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z) - Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs [30.56428628397079]
We propose a universal goal hijacking method called POUGH that incorporates semantic-guided prompt processing strategies.
The method starts with a sampling strategy to select representative prompts from a candidate pool, followed by a ranking strategy that prioritizes the prompts.
Experiments conducted on four popular Large Language Models and ten types of target responses verified the effectiveness of our method.
arXiv Detail & Related papers (2024-05-23T05:31:41Z) - Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences [11.23629471911503]
EvalGen provides automated assistance to users in generating evaluation criteria and implementing assertions.
A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment.
We identify a phenomenon we dub emphcriteria drift: users need criteria to grade outputs, but grading outputs helps users define criteria.
arXiv Detail & Related papers (2024-04-18T15:45:27Z) - Query-Dependent Prompt Evaluation and Optimization with Offline Inverse
RL [62.824464372594576]
We aim to enhance arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization.
We identify a previously overlooked objective of query dependency in such optimization.
We introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data.
arXiv Detail & Related papers (2023-09-13T01:12:52Z) - Preference Ranking Optimization for Human Alignment [90.6952059194946]
Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values.
Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment.
We propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to fine-tune LLMs for human alignment.
arXiv Detail & Related papers (2023-06-30T09:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.