Related papers: Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

URL: http://arxiv.org/abs/2512.23213v1
Date: Mon, 29 Dec 2025 05:25:49 GMT
Title: Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
Authors: Zhijun Chen, Zeyu Ji, Qianren Mao, Junhang Cheng, Bangjie Qin, Hao Wu, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, Hailong Sun,
Abstract summary: LLM-PeerReview is built on a novel, peer-review-inspired framework.<n>It operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique.<n>For reasoning, we can apply a graphical model-based truth inference algorithm.<n>Finally, the highest-scoring response is selected as the best ensemble output.
Score: 58.265053900416895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.

Related papers

Wisdom and Delusion of LLM Ensembles for Code Generation and Repair [45.969630994412846]
We compare ten individual Large Language Models with three ensembles of these LLMs across three software engineering benchmarks.<n>We find that the theoretical upperbound for an ensemble's performance can be 83% above the best single model.<n>A diversity-based strategy realizes up to 95% of this theoretical potential, and proves effective even in small two-model ensembles.
arXiv Detail & Related papers (2025-10-24T14:39:23Z)
Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information [57.397381631496906]
We develop two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP)<n>Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions.<n>We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN.
arXiv Detail & Related papers (2025-10-01T22:21:50Z)
Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems [55.6590601898194]
Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge.<n>Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model.<n>We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score.
arXiv Detail & Related papers (2025-09-30T01:25:19Z)
Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models [67.62810111789338]
Large Language Models exhibit a confidence distortion problem on multi-choice question-answering.<n>We propose Self-ensemble to solve this problem.<n> Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem.
arXiv Detail & Related papers (2025-06-02T17:59:29Z)
Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments [6.270885758858811]
Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging.<n>We propose a three-stage meta-judge selection pipeline: 1) developing a comprehensive rubric with GPT-4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low-scoring judgments.<n> Experimental results on the JudgeBench dataset show about 15.55% improvement compared to raw judgments and about 8.37% improvement over the single-agent baseline.
arXiv Detail & Related papers (2025-04-23T20:32:12Z)
SpecFuse: Ensembling Large Language Models via Next-Segment Prediction [42.28242821924789]
SpecFuse is an ensemble framework that outputs a fused result by iteratively producing the next segment through collaboration among LLMs.<n>The top-ranked segment is then broadcast to all LLMs, encouraging them to generate higher-quality segments in the next round.<n>To conserve computational resources, we propose a model exit mechanism that dynamically excludes models exhibiting poor performance in previous rounds.
arXiv Detail & Related papers (2024-12-10T10:27:41Z)
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection [90.71323430635593]
We propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer. This framework can be seamlessly integrated with existing approaches for superior self-detection.
arXiv Detail & Related papers (2024-03-15T02:38:26Z)
Identifying Factual Inconsistencies in Summaries: Grounding LLM Inference via Task Taxonomy [48.29181662640212]
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models. We consolidate key error types of inconsistent facts in summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs.
arXiv Detail & Related papers (2024-02-20T08:41:23Z)
PiCO: Peer Review in LLMs based on the Consistency Optimization [48.48819141999387]
We use peer-review mechanisms to measure large language models (LLMs) automatically.<n>We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores.<n>We propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings.
arXiv Detail & Related papers (2024-02-02T18:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.