PiCO: Peer Review in LLMs based on the Consistency Optimization
- URL: http://arxiv.org/abs/2402.01830v2
- Date: Sat, 20 Apr 2024 09:09:38 GMT
- Title: PiCO: Peer Review in LLMs based on the Consistency Optimization
- Authors: Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao, Zhen-Hui Liu, Yu Wang, Ming Pang, Li Yuan,
- Abstract summary: We use peer-review mechanisms to measure large language models (LLMs) automatically.
We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores.
We propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings.
- Score: 19.130941716491716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. To obtain the ability hierarchy among these models, we assign each LLM a learnable capability parameter to adjust the final ranking. We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores. The key assumption behind is that high-level LLM can evaluate others' answers more accurately than low-level ones, while higher-level LLM can also achieve higher response scores. Moreover, we propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings. We perform experiments on multiple datasets with these metrics, validating the effectiveness of the proposed approach.
Related papers
- Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation [16.73300162869746]
Large Language Models (LLMs) have made progress in various real-world tasks.
Existing evaluation methods are mainly supervised signal-based.
We propose a novel Deep Interaction-based LLM-evaluation framework.
arXiv Detail & Related papers (2023-09-08T15:00:41Z) - LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated.
We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization.
The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z) - On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets.
We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.