Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems
- URL: http://arxiv.org/abs/2510.02377v1
- Date: Tue, 30 Sep 2025 01:25:19 GMT
- Title: Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems
- Authors: Aakriti Agrawal, Rohith Aralikatti, Anirudh Satheesh, Souradip Chakraborty, Amrit Singh Bedi, Furong Huang,
- Abstract summary: Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge.<n>Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model.<n>We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score.
- Score: 55.6590601898194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.
Related papers
- Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process [58.265053900416895]
LLM-PeerReview is built on a novel, peer-review-inspired framework.<n>It operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique.<n>For reasoning, we can apply a graphical model-based truth inference algorithm.<n>Finally, the highest-scoring response is selected as the best ensemble output.
arXiv Detail & Related papers (2025-12-29T05:25:49Z) - Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information [57.397381631496906]
We develop two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP)<n>Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions.<n>We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN.
arXiv Detail & Related papers (2025-10-01T22:21:50Z) - M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following [4.119014132092875]
Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following.<n>M3PO is a novel and data-efficient method designed to enhance LVLMs' capabilities in visual instruction following.<n>M3PO intelligently selects the most "learning-valuable" preference sample pairs from a diverse pool of LVLM-generated candidates.
arXiv Detail & Related papers (2025-08-17T18:07:55Z) - Training-free LLM Verification via Recycling Few-shot Examples [11.796208194946141]
ReFeri evaluates the generated outputs by combining two different scores, motivated from Bayes' rule, and subsequently selects the candidate.<n>Experiments with three different LLMs and across seven diverse tasks demonstrate that our framework significantly improves the accuracy of LLMs.
arXiv Detail & Related papers (2025-06-08T10:02:07Z) - Self-ensemble: Mitigating Confidence Distortion for Large Language Models [89.03110940871765]
Large Language Models exhibit a confidence distortion problem on multi-choice question-answering.<n>We propose Self-ensemble to solve this problem.<n> Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem.
arXiv Detail & Related papers (2025-06-02T17:59:29Z) - Optimizing Model Selection for Compound AI Systems [76.69936664916061]
We propose an efficient framework for model selection in compound systems.<n>It iteratively selects one module and allocates to it the model with the highest module-wise performance.<n>It confers 5%-70% accuracy gains compared to using the same LLM for all modules.
arXiv Detail & Related papers (2025-02-20T18:36:25Z) - LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing [3.090041654375235]
We present a novel framework that formulates the LLM selection process as a multi-armed bandit problem.<n>Our approach incorporates a preference-conditioned dynamic routing mechanism, allowing users to specify their preferences at inference time.<n>Our method achieves significant improvements in both accuracy and cost-effectiveness across various LLM platforms.
arXiv Detail & Related papers (2025-02-04T22:09:43Z) - SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models [8.558834738072363]
Large language models (LLMs) have been widely adopted due to their remarkable performance across various applications.<n>These individual LLMs show limitations in generalization and performance on complex tasks due to inherent training biases, model size constraints, and the quality or diversity of pre-training datasets.<n>We introduce SelectLLM, which efficiently directs input queries to the most suitable subset of LLMs from a large pool.
arXiv Detail & Related papers (2024-08-16T06:11:21Z) - UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
We introduce UBench, a new benchmark for evaluating the uncertainty of large language models (LLMs)<n>Unlike other benchmarks, UBench is based on confidence intervals. It encompasses 11,978 multiple-choice questions spanning knowledge, language, understanding, and reasoning capabilities.<n>Our analysis reveals several crucial insights: 1) Our confidence interval-based methods are highly effective for uncertainty quantification; 2) Regarding uncertainty, outstanding open-source models show competitive performance versus closed-source models; 3) CoT and RP prompts present potential ways to improve model reliability, while the influence of temperature changes follows no universal rule.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - Self-prompted Chain-of-Thought on Large Language Models for Open-domain
Multi-hop Reasoning [70.74928578278957]
In open-domain question-answering (ODQA), most existing questions require single-hop reasoning on commonsense.
Large language models (LLMs) have found significant utility in facilitating ODQA without external corpus.
We propose Self-prompted Chain-of-Thought (SP-CoT), an automated framework to mass-produce high quality CoTs.
arXiv Detail & Related papers (2023-10-20T14:51:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.