Related papers: Agentic Confidence Calibration

Agentic Confidence Calibration

URL: http://arxiv.org/abs/2601.15778v1
Date: Thu, 22 Jan 2026 09:08:25 GMT
Title: Agentic Confidence Calibration
Authors: Jiaxin Zhang, Caiming Xiong, Chien-Sheng Wu,
Abstract summary: Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
Score: 67.50096917021521
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agents are rapidly advancing from passive language models to autonomous systems executing complex, multi-step tasks. Yet their overconfidence in failure remains a fundamental barrier to deployment in high-stakes settings. Existing calibration methods, built for static single-turn outputs, cannot address the unique challenges of agentic systems, such as compounding errors along trajectories, uncertainty from external tools, and opaque failure modes. To address these challenges, we introduce, for the first time, the problem of Agentic Confidence Calibration and propose Holistic Trajectory Calibration (HTC), a novel diagnostic framework that extracts rich process-level features ranging from macro dynamics to micro stability across an agent's entire trajectory. Powered by a simple, interpretable model, HTC consistently surpasses strong baselines in both calibration and discrimination, across eight benchmarks, multiple LLMs, and diverse agent frameworks. Beyond performance, HTC delivers three essential advances: it provides interpretability by revealing the signals behind failure, enables transferability by applying across domains without retraining, and achieves generalization through a General Agent Calibrator (GAC) that achieves the best calibration (lowest ECE) on the out-of-domain GAIA benchmark. Together, these contributions establish a new process-centric paradigm for confidence calibration, providing a framework for diagnosing and enhancing the reliability of AI agents.

Related papers

Agentic Uncertainty Quantification [76.94013626702183]
We propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals.<n>Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary.
arXiv Detail & Related papers (2026-01-22T07:16:26Z)
AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems [52.65695508605237]
We introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards.<n>By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities.<n>This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems.
arXiv Detail & Related papers (2026-01-14T11:32:07Z)
The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents [24.482362292984817]
Large language models (LLMs) are rapidly evolving to handle multi-turn tasks.<n> Ensuring their trustworthiness remains a critical challenge.<n> calibration refers to an agent's ability to express confidence that reliably reflects its actual performance.
arXiv Detail & Related papers (2026-01-12T07:10:35Z)
AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering [8.201374511929538]
AgentDevel is a release engineering pipeline that iteratively runs the current agent.<n>It produces implementation-blind, symptom-level quality signals from execution traces.<n>It aggregates dominant symptom patterns and produces auditable engineering specifications.
arXiv Detail & Related papers (2026-01-08T05:49:01Z)
Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems [54.916243942641444]
Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications.<n>We study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline.
arXiv Detail & Related papers (2025-12-23T03:10:09Z)
Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA [8.879640496857661]
We introduce AlignVQA, a debate-based multi-agent framework for Visual Question Answering.<n>We find that more calibrated specialized agents produce better aligned confidences.<n>We also introduce a novel differentiable calibration-aware loss function called aligncal to fine-tune the specialized agents.
arXiv Detail & Related papers (2025-11-14T11:08:21Z)
An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems [66.60904891478687]
We propose an Agentic Framework with LLMs (AFL) for solving complex vehicle routing problems.<n>AFL directly extracts knowledge from raw inputs and enables self-contained code generation.<n>We show that AFL substantially outperforms existing LLM-based baselines in both code reliability and solution feasibility.
arXiv Detail & Related papers (2025-10-19T03:59:25Z)
Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z)
Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference [8.823529310904162]
Multi-agent systems (MAS) are critical for automating complex tasks, yet their practical deployment is hampered by the challenge of failure attribution.<n>We introduce the first failure attribution framework for MAS grounded in multi-granularity causal inference.
arXiv Detail & Related papers (2025-09-10T15:22:00Z)
Conformal Calibration: Ensuring the Reliability of Black-Box AI in Wireless Systems [36.407171992845456]
The paper reviews conformal calibration, a general framework that moves beyond the state of the art by adopting computationally lightweight, advanced statistical tools.<n>By weaving conformal calibration into the AI model lifecycle, network operators can establish confidence in black-box AI models as a dependable enabling technology for wireless systems.
arXiv Detail & Related papers (2025-04-12T19:05:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.