Related papers: The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

URL: http://arxiv.org/abs/2601.07264v1
Date: Mon, 12 Jan 2026 07:10:35 GMT
Title: The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents
Authors: Weihao Xuan, Qingcheng Zeng, Heli Qi, Yunze Xiao, Junjue Wang, Naoto Yokoya,
Abstract summary: Large language models (LLMs) are rapidly evolving to handle multi-turn tasks.<n> Ensuring their trustworthiness remains a critical challenge.<n> calibration refers to an agent's ability to express confidence that reliably reflects its actual performance.
Score: 24.482362292984817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous agents based on large language models (LLMs) are rapidly evolving to handle multi-turn tasks, but ensuring their trustworthiness remains a critical challenge. A fundamental pillar of this trustworthiness is calibration, which refers to an agent's ability to express confidence that reliably reflects its actual performance. While calibration is well-established for static models, its dynamics in tool-integrated agentic workflows remain underexplored. In this work, we systematically investigate verbalized calibration in tool-use agents, revealing a fundamental confidence dichotomy driven by tool type. Specifically, our pilot study identifies that evidence tools (e.g., web search) systematically induce severe overconfidence due to inherent noise in retrieved information, while verification tools (e.g., code interpreters) can ground reasoning through deterministic feedback and mitigate miscalibration. To robustly improve calibration across tool types, we propose a reinforcement learning (RL) fine-tuning framework that jointly optimizes task accuracy and calibration, supported by a holistic benchmark of reward designs. We demonstrate that our trained agents not only achieve superior calibration but also exhibit robust generalization from local training environments to noisy web settings and to distinct domains such as mathematical reasoning. Our results highlight the necessity of domain-specific calibration strategies for tool-use agents. More broadly, this work establishes a foundation for building self-aware agents that can reliably communicate uncertainty in high-stakes, real-world deployments.

Related papers

On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z)
Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z)
From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z)
ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration [68.89572566071575]
ETAgent is a training framework for calibrating agent's tool-use behavior.<n>It is designed to progressively calibrate erroneous behavioral patterns to optimal behaviors.
arXiv Detail & Related papers (2026-01-11T11:05:26Z)
Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z)
Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates [56.73907811047611]
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities.<n>LLMs often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent.<n>We introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings.
arXiv Detail & Related papers (2025-09-22T17:55:14Z)
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z)
Confidence Calibration in Vision-Language-Action Models [2.230383995796716]
Trustworthy robot behavior requires high levels of task success and reliably quantify how likely it is to succeed.<n>We present the first systematic study of confidence calibration in vision-language-action (VLA) foundation models.
arXiv Detail & Related papers (2025-07-23T10:26:10Z)
More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents [24.84276066855418]
This study investigates whether agents are vulnerable to errors throughout the entire tool invocation process.<n>We observe that agents are highly susceptible to errors at each stage and agents based on open-source models are more vulnerable than those based on proprietary models.
arXiv Detail & Related papers (2025-06-27T07:13:29Z)
Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision [12.287123198288079]
Uncertainty calibration is essential for the safe deployment of large language models (LLMs)<n>We find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of language models.<n>We propose a simple rethinking method that boosts performance via test-time scaling based on calibrated uncertainty.
arXiv Detail & Related papers (2025-06-04T08:56:24Z)
Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z)
Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation [26.580361841501514]
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration.<n>This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information.<n>We propose a novel Confidence through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for object-centric queries.
arXiv Detail & Related papers (2025-04-21T04:01:22Z)
Trust, but Verify: Using Self-Supervised Probing to Improve Trustworthiness [29.320691367586004]
We introduce a new approach of self-supervised probing, which enables us to check and mitigate the overconfidence issue for a trained model. We provide a simple yet effective framework, which can be flexibly applied to existing trustworthiness-related methods in a plug-and-play manner.
arXiv Detail & Related papers (2023-02-06T08:57:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.