Self-ensemble: Mitigating Confidence Distortion for Large Language Models
- URL: http://arxiv.org/abs/2506.01951v1
- Date: Mon, 02 Jun 2025 17:59:29 GMT
- Title: Self-ensemble: Mitigating Confidence Distortion for Large Language Models
- Authors: Zicheng Xu, Guanchu Wang, Guangyao Zheng, Yu-Neng Chuang, Alexander Szalay, Xia Hu, Vladimir Braverman,
- Abstract summary: Large Language Models exhibit a confidence distortion problem on multi-choice question-answering.<n>We propose Self-ensemble to solve this problem.<n> Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem.
- Score: 89.03110940871765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem of LLMs, outperforming standard inference as well as baseline methods.
Related papers
- CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks [63.562924932512765]
Large Language Models (LLMs) have advanced the state-of-the-art in various coding tasks.<n>LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models.
arXiv Detail & Related papers (2025-07-14T17:56:29Z) - Set-LLM: A Permutation-Invariant LLM [2.9665130256021]
This paper is motivated by a specific vulnerability: the order sensitivity of large language models (LLMs)<n>We introduce Set-LLM, a novel architectural adaptation for pretrained LLMs that enables the processing of mixed set-text inputs with permutation invariance guarantees.
arXiv Detail & Related papers (2025-05-21T12:14:26Z) - The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding? [39.602857110637736]
This paper presents Crescent -- a framework for generating high-quality synthetic question-answer data in a fully autonomous manner.<n>We show that Crescent sheds light on the potential of true self-improvement with zero external supervision signals for math reasoning.
arXiv Detail & Related papers (2025-02-19T05:37:08Z) - LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z) - Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z) - SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models [8.558834738072363]
Large language models (LLMs) have been widely adopted due to their remarkable performance across various applications.<n>These individual LLMs show limitations in generalization and performance on complex tasks due to inherent training biases, model size constraints, and the quality or diversity of pre-training datasets.<n>We introduce SelectLLM, which efficiently directs input queries to the most suitable subset of LLMs from a large pool.
arXiv Detail & Related papers (2024-08-16T06:11:21Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - REQUAL-LM: Reliability and Equity through Aggregation in Large Language Models [10.684722193666607]
We introduce REQUAL-LM, a novel method for finding reliable and equitable large language models (LLMs) outputs through aggregation.
Specifically, we develop a Monte Carlo method based on repeated sampling to find a reliable output close to the mean of the underlying distribution of possible outputs.
We formally define the terms such as reliability and bias, and design an equity-aware aggregation to minimize harmful bias while finding a highly reliable output.
arXiv Detail & Related papers (2024-04-17T22:12:41Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.