Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement
- URL: http://arxiv.org/abs/2509.22751v2
- Date: Mon, 03 Nov 2025 20:40:52 GMT
- Title: Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement
- Authors: Kaihua Ding,
- Abstract summary: We introduce VB-Score, a variance-bounded evaluation framework for entity-centric AI systems.<n> VB-Score enumerates plausible interpretations through constraint relaxation and Monte Carlo sampling.<n>It then evaluates system outputs by their expected success across interpretations, penalized by variance to assess robustness of the system.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliable evaluation of AI systems remains a fundamental challenge when ground truth labels are unavailable, particularly for systems generating natural language outputs like AI chat and agent systems. Many of these AI agents and systems focus on entity-centric tasks. In enterprise contexts, organizations deploy AI systems for entity linking, data integration, and information retrieval where verification against gold standards is often infeasible due to proprietary data constraints. Academic deployments face similar challenges when evaluating AI systems on specialized datasets with ambiguous criteria. Conventional evaluation frameworks, rooted in supervised learning paradigms, fail in such scenarios where single correct answers cannot be defined. We introduce VB-Score, a variance-bounded evaluation framework for entity-centric AI systems that operates without ground truth by jointly measuring effectiveness and robustness. Given system inputs, VB-Score enumerates plausible interpretations through constraint relaxation and Monte Carlo sampling, assigning probabilities that reflect their likelihood. It then evaluates system outputs by their expected success across interpretations, penalized by variance to assess robustness of the system. We provide formal theoretical analysis establishing key properties including range, monotonicity, and stability along with concentration bounds for Monte Carlo estimation. Through case studies on AI systems with ambiguous inputs, we demonstrate that VB-Score reveals robustness differences hidden by conventional evaluation frameworks, offering a principled measurement framework for assessing AI system reliability in label-scarce domains.
Related papers
- CIRCLE: A Framework for Evaluating AI from a Real-World Lens [10.028017198571833]
CIRCLE aims to bridge the gap between model-centric performance metrics and AI's materialized outcomes in deployment.<n>CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics.
arXiv Detail & Related papers (2026-02-27T14:43:23Z) - The Necessity of a Unified Framework for LLM-Based Agent Evaluation [46.631678638677386]
General-purpose agents have seen fundamental advancements.<n> evaluating these agents presents unique challenges that distinguish them from static QA benchmarks.<n>We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation.
arXiv Detail & Related papers (2026-02-03T08:18:37Z) - Robust Verification of Controllers under State Uncertainty via Hamilton-Jacobi Reachability Analysis [49.31947916567367]
Hamilton-Jacobi (J) reachability analysis is a popular formal verification tool for general nonlinear systems that can compute optimal reachable under worst-case uncertainties.<n>This work is the first HJ-based reachability-based system verification framework for the Robust Verification Controllers via HJ rover.<n>Within Ro-CoRe, we propose novel methods for safety verification and controller design.
arXiv Detail & Related papers (2025-11-18T18:55:20Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - Safe and Certifiable AI Systems: Concepts, Challenges, and Lessons Learned [45.44933002008943]
This white paper presents the T"UV AUSTRIA Trusted AI framework.<n>It is an end-to-end audit catalog and methodology for assessing and certifying machine learning systems.<n>Building on three pillars - Secure Software Development, Functional Requirements, and Ethics & Data Privacy - it translates the high-level obligations of the EU AI Act into specific, testable criteria.
arXiv Detail & Related papers (2025-09-08T17:52:08Z) - Ethical AI: Towards Defining a Collective Evaluation Framework [0.3413711585591077]
Artificial Intelligence (AI) is transforming sectors such as healthcare, finance, and autonomous systems.<n>Yet its rapid integration raises urgent ethical concerns related to data ownership, privacy, and systemic bias.<n>This article proposes a modular ethical assessment framework built on ontological blocks of meaning-discrete, interpretable units.
arXiv Detail & Related papers (2025-05-30T21:10:47Z) - Position: Bayesian Statistics Facilitates Stakeholder Participation in Evaluation of Generative AI [0.0]
The evaluation of Generative AI (GenAI) systems plays a critical role in public policy and decision-making.<n>Existing methods are often limited by reliance on benchmark-driven, point-estimate comparisons.<n>This paper argues for the use of Bayesian statistics as a principled framework to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:31:15Z) - SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection.<n>Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains.<n>We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - Demographic Benchmarking: Bridging Socio-Technical Gaps in Bias Detection [0.0]
This paper describes how the ITACA AI auditing platform tackles demographic benchmarking when auditing AI recommender systems.<n>The framework serves us as auditors as it allows us to not just measure but establish acceptability ranges for specific performance indicators.<n>Our approach integrates socio-demographic insights directly into AI systems, reducing bias and improving overall performance.
arXiv Detail & Related papers (2025-01-27T12:14:49Z) - ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation [2.1517210693540005]
Uncertainty estimation is an essential and heavily-studied component for semantic segmentation methods.
Can data-related and model-related uncertainty really be separated in practice?
Which components of an uncertainty method are essential for real-world performance?
arXiv Detail & Related papers (2024-01-16T17:02:21Z) - Evaluating General-Purpose AI with Psychometrics [43.85432514910491]
We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models.
Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems.
To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
arXiv Detail & Related papers (2023-10-25T05:38:38Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - Evaluating AI systems under uncertain ground truth: a case study in dermatology [43.8328264420381]
We show that ignoring uncertainty leads to overly optimistic estimates of model performance.<n>In skin condition classification, we find that a large portion of the dataset exhibits significant ground truth uncertainty.
arXiv Detail & Related papers (2023-07-05T10:33:45Z) - Towards a multi-stakeholder value-based assessment framework for
algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values.
We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z) - Fairness Score and Process Standardization: Framework for Fairness
Certification in Artificial Intelligence Systems [0.4297070083645048]
We propose a novel Fairness Score to measure the fairness of a data-driven AI system.
It will also provide a framework to operationalise the concept of fairness and facilitate the commercial deployment of such systems.
arXiv Detail & Related papers (2022-01-10T15:45:12Z) - Approaching Neural Network Uncertainty Realism [53.308409014122816]
Quantifying or at least upper-bounding uncertainties is vital for safety-critical systems such as autonomous vehicles.
We evaluate uncertainty realism -- a strict quality criterion -- with a Mahalanobis distance-based statistical test.
We adopt it to the automotive domain and show that it significantly improves uncertainty realism compared to a plain encoder-decoder model.
arXiv Detail & Related papers (2021-01-08T11:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.