From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review
- URL: http://arxiv.org/abs/2506.11343v1
- Date: Thu, 12 Jun 2025 22:27:20 GMT
- Title: From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review
- Authors: Yaohui Zhang, Haijing Zhang, Wenlong Ji, Tianyu Hua, Nick Haber, Hancheng Cao, Weixin Liang,
- Abstract summary: We introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts.<n>Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers.
- Score: 11.761671590108406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.
Related papers
- Exploring the Potential of LLMs for Serendipity Evaluation in Recommender Systems [10.227007419503297]
Large language models (LLMs) are increasingly revolutionizing evaluation methodologies across various human annotation tasks.<n>We conduct a meta-evaluation on two datasets derived from real user studies in the e-commerce and movie domains.<n>Our findings indicate that even the simplest zero-shot LLMs achieve parity with, or surpass, the performance of conventional metrics.
arXiv Detail & Related papers (2025-07-23T07:51:56Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z) - Enhancing LLM Evaluations: The Garbling Trick [0.0]
We propose a method to transform existing large language models (LLMs) evaluations into a series of progressively more difficult tasks.<n>These enhanced evaluations emphasize reasoning capabilities and can reveal relative performance differences that are not apparent in the original assessments.<n>Our results offer insights into the comparative abilities of these models, particularly highlighting the differences between base LLMs and more recent "reasoning" models.
arXiv Detail & Related papers (2024-11-03T11:39:50Z) - RevisEval: Improving LLM-as-a-Judge via Response-Adapted References [95.29800580588592]
RevisEval is a novel text generation evaluation paradigm via the response-adapted references.<n>RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated.
arXiv Detail & Related papers (2024-10-07T16:50:47Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language.<n>LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.<n>We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets.
We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.