Fugu-MT 論文翻訳(概要): Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement

論文の概要: Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement

arxiv url: http://arxiv.org/abs/2509.22751v2
Date: Mon, 03 Nov 2025 20:40:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.452002
Title: Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement
Title（参考訳）: 地中真理のないエンティティ中心型AIシステムの変数境界評価:理論と測定
Authors: Kaihua Ding,
Abstract要約: 本稿では,エンティティ中心型AIシステムのための分散境界評価フレームワークであるVB-Scoreを紹介する。 VB-Scoreは制約緩和とモンテカルロサンプリングを通じて可算解釈を列挙する。そして、システムの堅牢性を評価するために、システムアウトプットを解釈を越えて予測される成功によって評価し、分散によって罰する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable evaluation of AI systems remains a fundamental challenge when ground truth labels are unavailable, particularly for systems generating natural language outputs like AI chat and agent systems. Many of these AI agents and systems focus on entity-centric tasks. In enterprise contexts, organizations deploy AI systems for entity linking, data integration, and information retrieval where verification against gold standards is often infeasible due to proprietary data constraints. Academic deployments face similar challenges when evaluating AI systems on specialized datasets with ambiguous criteria. Conventional evaluation frameworks, rooted in supervised learning paradigms, fail in such scenarios where single correct answers cannot be defined. We introduce VB-Score, a variance-bounded evaluation framework for entity-centric AI systems that operates without ground truth by jointly measuring effectiveness and robustness. Given system inputs, VB-Score enumerates plausible interpretations through constraint relaxation and Monte Carlo sampling, assigning probabilities that reflect their likelihood. It then evaluates system outputs by their expected success across interpretations, penalized by variance to assess robustness of the system. We provide formal theoretical analysis establishing key properties including range, monotonicity, and stability along with concentration bounds for Monte Carlo estimation. Through case studies on AI systems with ambiguous inputs, we demonstrate that VB-Score reveals robustness differences hidden by conventional evaluation frameworks, offering a principled measurement framework for assessing AI system reliability in label-scarce domains.
Abstract（参考訳）: AIシステムに対する信頼性の高い評価は、特にAIチャットやエージェントシステムのような自然言語出力を生成するシステムにおいて、基礎的な真理ラベルが利用できない場合、依然として根本的な課題である。これらのAIエージェントやシステムは、エンティティ中心のタスクに重点を置いている。エンタープライズ環境では、エンティティリンク、データ統合、情報検索のためのAIシステムをデプロイする。学術的なデプロイメントは、曖昧な基準で専門的なデータセット上でAIシステムを評価する際に、同様の課題に直面します。教師付き学習パラダイムに根ざした従来の評価フレームワークは、単一の正しい回答が定義できないようなシナリオでは失敗する。 VB-Scoreは,実効性とロバスト性を共同で測定することで,真理なしに動作可能な,エンティティ中心のAIシステムを対象とした分散バウンド評価フレームワークである。系の入力が与えられたとき、VB-Scoreは制約緩和とモンテカルロサンプリングを通じて可算解釈を列挙し、確率を反映する確率を割り当てる。そして、システムの堅牢性を評価するために、システムアウトプットを解釈を越えて予測される成功によって評価し、分散によって罰する。我々は、モンテカルロ推定のための濃度境界とともに、範囲、単調性、安定性を含む重要な性質を確立する公式な理論的解析を提供する。あいまいな入力を持つAIシステムのケーススタディを通じて、VB-Scoreは従来の評価フレームワークに隠された堅牢性の違いを明らかにし、ラベルスカースドメインにおけるAIシステムの信頼性を評価するための基本的な測定フレームワークを提供する。

論文の概要: Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement

関連論文リスト