Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA
- URL: http://arxiv.org/abs/2511.11169v1
- Date: Fri, 14 Nov 2025 11:08:21 GMT
- Title: Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA
- Authors: Ayush Pandey, Jai Bardhan, Ishita Jain, Ramya S Hebbalaguppe, Rohan Raju Dhanakshirur, Lovekesh Vig,
- Abstract summary: We introduce AlignVQA, a debate-based multi-agent framework for Visual Question Answering.<n>We find that more calibrated specialized agents produce better aligned confidences.<n>We also introduce a novel differentiable calibration-aware loss function called aligncal to fine-tune the specialized agents.
- Score: 8.879640496857661
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system's confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM -- each following distinct prompting strategies -- generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model's true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration-aware loss function called aligncal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent's confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration-aware loss to fine-tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.
Related papers
- On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z) - Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z) - Confidence Estimation for LLMs in Multi-turn Interactions [48.081802290688394]
This work presents the first systematic study of confidence estimation in multi-turn interactions.<n>We establish a formal evaluation framework grounded in two key desideratas: per-turn calibration and monotonicity of confidence.<n>Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.
arXiv Detail & Related papers (2026-01-05T14:58:04Z) - BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents [58.05949210993854]
We investigate whether search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions.<n>We propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level.
arXiv Detail & Related papers (2025-10-27T15:58:51Z) - Confidence Calibration in Vision-Language-Action Models [2.230383995796716]
Trustworthy robot behavior requires high levels of task success and reliably quantify how likely it is to succeed.<n>We present the first systematic study of confidence calibration in vision-language-action (VLA) foundation models.
arXiv Detail & Related papers (2025-07-23T10:26:10Z) - Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models [4.035767214877404]
We introduce Prompt4Trust, the first reinforcement learning framework for prompt augmentation targeting confidence calibration in MLLMs.<n>Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making.<n>Our framework showed promising zero-shot generalizations to larger MLLMs in our experiments.
arXiv Detail & Related papers (2025-07-12T13:21:10Z) - Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding [48.92310906093414]
We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs)<n>We leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models.<n>We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA.
arXiv Detail & Related papers (2025-04-30T19:19:21Z) - Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation [26.580361841501514]
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration.<n>This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information.<n>We propose a novel Confidence through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for object-centric queries.
arXiv Detail & Related papers (2025-04-21T04:01:22Z) - Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models [0.6091702876917281]
Large Language Models (LLMs) show remarkable proficiency in natural language tasks.<n>Overconfidence-misalignment between predicted confidence and true correctness poses significant risks in critical decision-making applications.<n>We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering datasets.
arXiv Detail & Related papers (2025-02-16T07:46:09Z) - Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer [62.01554688056335]
Overestimation in the multiagent setting has received comparatively little attention.<n>We propose a novel hypernet regularizer on hypernetwork weights and biases to constrain the optimization of online global Q-network to prevent overestimation accumulation.
arXiv Detail & Related papers (2025-02-04T05:14:58Z) - Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration [20.049443396032423]
Black-box large language models (LLMs) are increasingly deployed in various environments.
LLMs often exhibit overconfidence, leading to potential risks and misjudgments.
We propose a novel method, textitAtypical presentations Recalibration, which leverages atypical presentations to adjust the model's confidence estimates.
arXiv Detail & Related papers (2024-09-05T03:45:35Z) - When to Trust LLMs: Aligning Confidence with Response Quality [49.371218210305656]
We propose CONfidence-Quality-ORDer-preserving alignment approach (CONQORD)
It integrates quality reward and order-preserving alignment reward functions.
Experiments demonstrate that CONQORD significantly improves the alignment performance between confidence and response accuracy.
arXiv Detail & Related papers (2024-04-26T09:42:46Z) - Beyond calibration: estimating the grouping loss of modern neural
networks [68.8204255655161]
Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss.
We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings.
arXiv Detail & Related papers (2022-10-28T07:04:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.