Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
- URL: http://arxiv.org/abs/2602.21368v1
- Date: Tue, 24 Feb 2026 21:03:50 GMT
- Title: Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
- Authors: Charafeddine Mouzouni,
- Abstract summary: We answer with a reliability level -- a single number per system-task pair.<n>Self-consistency sampling reduces uncertainty exponentially.<n> conformal calibration guarantees correctness within 1/(n+1) of the target level.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.
Related papers
- When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning [16.505918019260964]
We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable predictions.<n>We show that 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways.
arXiv Detail & Related papers (2026-03-03T19:43:36Z) - MultiVer: Zero-Shot Multi-Agent Vulnerability Detection [0.0]
MultiVer is a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning.<n>A four-agent ensemble with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points.
arXiv Detail & Related papers (2026-02-19T22:20:17Z) - CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute [10.548368675645403]
CoRefine is a confidence-guided self-refinement method that achieves competitive accuracy using a fraction of the tokens.<n>The controller consumes full-trace confidence to decide whether to halt, re-examine, or try a different approach.<n>We extend this to CoRefine-Tree, a hybrid sequential-parallel variant that adaptively balances exploration and exploitation.
arXiv Detail & Related papers (2026-02-09T17:44:41Z) - Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z) - Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems [0.29465623430708904]
Uncalibrated scores can invert preferences, naive confidence intervals on uncalibrated scores achieve near-0% coverage, and importance-weighted estimators collapse under limited overlap.<n>We introduce Causal Judge Evaluation, a framework that fixes all three failures.
arXiv Detail & Related papers (2025-12-11T22:16:24Z) - Annotation-Efficient Universal Honesty Alignment [70.05453324928955]
Existing methods either rely on training-free confidence estimation or training-based calibration with correctness annotations.<n>Elicitation-Then-Calibration (EliCal) is a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations.<n>EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline.
arXiv Detail & Related papers (2025-10-20T13:05:22Z) - Unsupervised Conformal Inference: Bootstrapping and Alignment to Control LLM Uncertainty [49.19257648205146]
We propose an unsupervised conformal inference framework for generation.<n>Our gates achieve close-to-nominal coverage and provide tighter, more stable thresholds than split UCP.<n>The result is a label-free, API-compatible gate for test-time filtering.
arXiv Detail & Related papers (2025-09-26T23:40:47Z) - A Confidence-Diversity Framework for Calibrating AI Judgement in Accessible Qualitative Coding Tasks [0.0]
Confidence-diversity calibration is a quality assessment framework for accessible coding tasks.<n>Analysing 5,680 coding decisions from eight state-of-the-art LLMs, we find that mean self-confidence tracks inter-model agreement closely.
arXiv Detail & Related papers (2025-08-04T03:47:10Z) - The Confidence Paradox: Can LLM Know When It's Wrong [5.445980143646736]
Document Visual Question Answering (DocVQA) models often produce overconfident or ethically misaligned responses.<n>We propose HonestVQA, a model-agnostic, self-supervised framework that aligns model confidence with correctness using weighted loss and contrastive learning.
arXiv Detail & Related papers (2025-06-30T02:06:54Z) - Robust Conformal Prediction with a Single Binary Certificate [58.450154976190795]
Conformal prediction (CP) converts any model's output to prediction sets with a guarantee to cover the true label with (adjustable) high probability.<n>We propose a robust conformal prediction that produces smaller sets even with significantly lower MC samples.
arXiv Detail & Related papers (2025-03-07T08:41:53Z) - ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees [68.33498595506941]
We introduce a novel uncertainty measure based on self-consistency theory.
We then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm.
Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods.
arXiv Detail & Related papers (2024-06-29T17:33:07Z) - Llamas Know What GPTs Don't Show: Surrogate Models for Confidence
Estimation [70.27452774899189]
Large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user.
As of November 2023, state-of-the-art LLMs do not provide access to these probabilities.
Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets.
arXiv Detail & Related papers (2023-11-15T11:27:44Z) - Beyond calibration: estimating the grouping loss of modern neural
networks [68.8204255655161]
Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss.
We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings.
arXiv Detail & Related papers (2022-10-28T07:04:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.