Related papers: Statistical Confidence in Functional Correctness: An Approach for AI Product Functional Correctness Evaluation

Statistical Confidence in Functional Correctness: An Approach for AI Product Functional Correctness Evaluation

URL: http://arxiv.org/abs/2602.18357v1
Date: Fri, 20 Feb 2026 17:06:38 GMT
Title: Statistical Confidence in Functional Correctness: An Approach for AI Product Functional Correctness Evaluation
Authors: Wallace Albertini, Marina Condé Araújo, Júlia Condé Araújo, Antonio Pedro Santos Alves, Marcos Kalinowski,
Abstract summary: This paper proposes and evaluates the Statistical Confidence in Functional Correctness (SCFC) approach.<n>The approach consists of four steps: defining quantitative specification limits, performing stratified and probabilistic sampling, applying bootstrapping to estimate a confidence interval for the performance metric, and calculating a capability index as a final indicator.<n>We conclude that the proposed approach is a feasible and valuable way to operationalize the assessment of functional correctness, moving the evaluation from a point estimate to a statement of statistical confidence.
Score: 1.4521584395164622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The quality assessment of Artificial Intelligence (AI) systems is a fundamental challenge due to their inherently probabilistic nature. Standards such as ISO/IEC 25059 provide a quality model, but they lack practical and statistically robust methods for assessing functional correctness. This paper proposes and evaluates the Statistical Confidence in Functional Correctness (SCFC) approach, which seeks to fill this gap by connecting business requirements to a measure of statistical confidence that considers both the model's average performance and its variability. The approach consists of four steps: defining quantitative specification limits, performing stratified and probabilistic sampling, applying bootstrapping to estimate a confidence interval for the performance metric, and calculating a capability index as a final indicator. The approach was evaluated through a case study on two real-world AI systems in industry involving interviews with AI experts. Valuable insights were collected from the experts regarding the utility, ease of use, and intention to adopt the methodology in practical scenarios. We conclude that the proposed approach is a feasible and valuable way to operationalize the assessment of functional correctness, moving the evaluation from a point estimate to a statement of statistical confidence.

Related papers

What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities [0.773472615056109]
Evaluations of generative models on benchmark data are now ubiquitous.<n>Yet growing skepticism surrounds their reliability.<n>How can we know that a reported accuracy genuinely reflects a model's true performance?<n>We make this step explicit by proposing a principled framework for evaluation as inference.
arXiv Detail & Related papers (2025-09-23T21:29:04Z)
Get Global Guarantees: On the Probabilistic Nature of Perturbation Robustness [10.738378139028976]
In safety-critical deep learning applications, robustness measures the ability of neural models that handle imperceptible perturbations in input data.<n>Existing pre-deployment robustness assessment methods typically suffer from significant trade-offs between computational cost and measurement precision.<n>We propose tower robustness to evaluate robustness, which is a novel, practical metric based on hypothesis testing.
arXiv Detail & Related papers (2025-08-26T16:41:04Z)
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z)
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels [16.300463494913593]
Large Language Models (LLMs) require robust confidence estimation.<n>McQCA-Eval is an evaluation framework for assessing confidence measures in Natural Language Generation.
arXiv Detail & Related papers (2025-02-20T05:09:29Z)
Know Where You're Uncertain When Planning with Multimodal Foundation Models: A Formal Framework [54.40508478482667]
We present a comprehensive framework to disentangle, quantify, and mitigate uncertainty in perception and plan generation.<n>We propose methods tailored to the unique properties of perception and decision-making.<n>We show that our uncertainty disentanglement framework reduces variability by up to 40% and enhances task success rates by 5% compared to baselines.
arXiv Detail & Related papers (2024-11-03T17:32:00Z)
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework for Large Language Models (LLMs)<n> Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model.<n>Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z)
"A Good Bot Always Knows Its Limitations": Assessing Autonomous System Decision-making Competencies through Factorized Machine Self-confidence [5.167803438665586]
This paper presents the Factorized Machine Self-confidence (FaMSeC) framework, which holistically considers several major factors driving competency in algorithmic decision-making.<n>In FaMSeC, self-confidence indicators are derived via 'problem-solving statistics' embedded in Markov decision process solvers.<n>We include detailed descriptions and examples for Markov decision process agents, and show how outcome assessment and solver quality factors can be found for a range of tasking contexts.
arXiv Detail & Related papers (2024-07-29T01:22:04Z)
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.<n>We propose a method called Stratified Prediction-Powered Inference (StratPPI)<n>We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z)
Functional trustworthiness of AI systems by statistically valid testing [7.717286312400472]
The authors are concerned about the safety, health, and rights of the European citizens due to inadequate measures and procedures required by the current draft of the EU Artificial Intelligence (AI) Act. We observe that not only the current draft of the EU AI Act, but also the accompanying standardization efforts in CEN/CENELEC, have resorted to the position that real functional guarantees of AI systems supposedly would be unrealistic and too complex anyways.
arXiv Detail & Related papers (2023-10-04T11:07:52Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning. We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z)
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.