Related papers: DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics

Related papers

Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z)
Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models [0.0]
Uncertainty of answers can help prevent misleading or serious hallucinations for users.<n>Current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences.<n>We propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps.
arXiv Detail & Related papers (2026-01-19T20:04:34Z)
Demystifying Multi-Agent Debate: The Role of Confidence and Diversity [31.236476720977294]
Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling.<n>Recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost.<n>We identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication.
arXiv Detail & Related papers (2026-01-09T02:38:30Z)
Confidence Estimation for LLMs in Multi-turn Interactions [48.081802290688394]
This work presents the first systematic study of confidence estimation in multi-turn interactions.<n>We establish a formal evaluation framework grounded in two key desideratas: per-turn calibration and monotonicity of confidence.<n>Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.
arXiv Detail & Related papers (2026-01-05T14:58:04Z)
Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection [81.52796950244705]
Self-diagnosis is unreliable on complex tasks unless aided by reliable external feedback.<n>We introduce a new collaborative MAD protocol, termed ColMAD, that reframes MAD as a non-zero sum game.<n>We show that ColMAD significantly outperforms previous competitive MAD by 19%.
arXiv Detail & Related papers (2025-10-23T19:46:00Z)
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment [22.305033366660187]
Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts.<n>We formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA)<n>MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision.
arXiv Detail & Related papers (2025-09-18T17:27:28Z)
Evaluating Language Model Reasoning about Confidential Information [95.64687778185703]
We study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications.<n>We develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized.<n>We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance.
arXiv Detail & Related papers (2025-08-27T15:39:46Z)
Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision [12.287123198288079]
Uncertainty calibration is essential for the safe deployment of large language models (LLMs)<n>We find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of language models.<n>We propose a simple rethinking method that boosts performance via test-time scaling based on calibrated uncertainty.
arXiv Detail & Related papers (2025-06-04T08:56:24Z)
Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning [8.800516398660069]
Multiagent collaboration has emerged as a promising framework for enhancing the reasoning capabilities of large language models (LLMs)<n>We propose Debate Only When Necessary (DOWN), an adaptive multiagent debate framework that selectively activates debate based on the confidence score of the agent's initial response.<n>Down improves efficiency by up to six times while preserving or even outperforming the performance of existing methods.
arXiv Detail & Related papers (2025-04-07T13:17:52Z)
ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA [13.386562087058596]
ReAgent is a reversible multi-Agent collaborative framework augmented with explicit backtracking mechanisms. Our system can detect and correct errors mid-reasoning, leading to more robust and interpretable QA outcomes.
arXiv Detail & Related papers (2025-03-10T05:56:46Z)
DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction [53.803276766404494]
Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty. We propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction. Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods.
arXiv Detail & Related papers (2024-12-12T18:52:40Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space [14.715989394285238]
Existing Large Language Models (LLMs) do not have an inherent functionality to provide the users with an uncertainty/confidence metric for each response it generates. A new framework is proposed in this paper to address these issues. Semantic density extracts uncertainty/confidence information for each response from a probability distribution perspective in semantic space.
arXiv Detail & Related papers (2024-05-22T17:13:49Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
Uncertainty in Language Models: Assessment through Rank-Calibration [65.10149293133846]
Language Models (LMs) have shown promising performance in natural language generation. It is crucial to correctly quantify their uncertainty in responding to given inputs. We develop a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs.
arXiv Detail & Related papers (2024-04-04T02:31:05Z)
Multi-Perspective Consistency Enhances Confidence Estimation in Large Language Models [27.63938857490995]
This work focuses on improving the confidence estimation of large language models. Considering the fragility of self-awareness in language models, we introduce a Multi-Perspective Consistency (MPC) method. The experimental results on eight publicly available datasets show that our MPC achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-02-17T13:37:39Z)
Towards Uncertainty-Aware Language Agent [10.227089771963943]
We present the Uncertainty-Aware Language Agent (UALA), a framework that orchestrates the interaction between the agent and the external world using uncertainty quantification. Our experiments demonstrate that UALA brings a significant improvement of performance, while having a substantially lower reliance on the external world.
arXiv Detail & Related papers (2024-01-25T08:48:21Z)
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs [7.7433783185451075]
We benchmark a range of debating and prompting strategies to explore the trade-offs between cost, time, and accuracy. We find that multi-agent debating systems, in their current form, do not reliably outperform other proposed prompting strategies. We build on these results to offer insights into improving debating strategies, such as adjusting agent agreement levels.
arXiv Detail & Related papers (2023-11-29T05:54:41Z)
Ask Again, Then Fail: Large Language Models' Vacillations in Judgment [28.74246375289661]
We observe that current conversational language models often waver in their judgments when faced with follow-up questions. We introduce a textscFollow-up Questioning Mechanism along with two metrics to quantify this inconsistency. We develop a training-based framework textscUnwavering-FQ that teaches language models to maintain their originally correct judgments.
arXiv Detail & Related papers (2023-10-03T16:08:41Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [37.63939774027709]
Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities. We propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment. Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses.
arXiv Detail & Related papers (2023-05-30T16:31:26Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour. Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z)
REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems. A prediction model is designed to estimate the reliability of the given reference set. We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z)
Learning to Communicate and Correct Pose Errors [75.03747122616605]
We study the setting proposed in V2VNet, where nearby self-driving vehicles jointly perform object detection and motion forecasting in a cooperative manner. We propose a novel neural reasoning framework that learns to communicate, to estimate potential errors, and to reach a consensus about those errors.
arXiv Detail & Related papers (2020-11-10T18:19:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.