Fugu-MT 論文翻訳(概要): Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems

論文の概要: Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems

arxiv url: http://arxiv.org/abs/2606.23403v1
Date: Mon, 22 Jun 2026 14:26:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 19:18:52.608909
Title: Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems
Title（参考訳）: Litmus: AIシステム評価のためのコード駆動メトリック仕様であるZero-Label
Authors: Prajjwal Gupta, Prasang Gupta, Vishal Bhutani, Apoorva Sharma, Sumanth Chundru, Waqar Sarguroh, Kevin Paul,
Abstract要約: AIパイプラインの評価とモニタリングのメトリクスを設計するゼロラベルシステムであるLitmusを紹介する。評価対象がすでに知られていると仮定する代わりに、Litmus氏はまず測定すべきものと理由を特定する。 Litmusを3つの実コード定義AIパイプラインで評価する。
参考スコア（独自算出の注目度）: 1.5377897575579675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As agentic LLM systems move from prototypes to deployment across increasingly diverse domains, evaluating them has become both more important and more difficult. The challenge is not only that individual metrics may be unreliable, but that evaluation goals are often left implicit. Without a clear account of what a system is expected to do, how it can fail, and which failures matter, metric choices become difficult to justify, interpret, or validate. We present Litmus, a zero-label system that designs evaluation and monitoring metrics for AI pipelines by eliciting evaluation intent from source code and targeted interrogation. Instead of assuming that the evaluation target is already known, Litmus first identifies what must be measured and why, then converts those answers into constraints for constructing a justified, per-stage metric portfolio. We evaluate Litmus on three real, code-defined AI pipelines - financial account grouping, scientific QA, and inherent risk assessment - against AutoMetrics and three DynamicRubric baselines. Litmus achieves the broadest or tied-broadest concern coverage, spans more pipeline stages, produces a near-zero-redundancy portfolio, and ranks first in validity against per-row quality labels on all three pipelines - decisively on scientific QA (Spearman $ρ=0.72$ vs. less than $0.47$ for every baseline), and within overlapping confidence intervals in relation to two components of the audit framework despite using no labels during metric design. Our results support a shift from automatic metric implementation to automatic metric specification: before asking which metric to compute, evaluation systems should ask what must be measured and why.
Abstract（参考訳）: エージェントLLMシステムがプロトタイプから、ますます多様なドメインにまたがるデプロイに移行するにつれ、それらを評価することはより重要で、より難しいものになってきています。課題は、個々のメトリクスが信頼できないだけでなく、評価目標が暗黙的に残されることです。システムが何をするか、どのように失敗するか、どの失敗が重要か、といった明確な説明がなければ、メトリクスの選択を正当化、解釈、検証することは難しくなります。我々は、ソースコードから評価意図を抽出し、ターゲットの尋問を行うことで、AIパイプラインの評価と監視のメトリクスを設計するゼロラベルシステムであるLitmusを提案する。評価対象がすでに分かっていると仮定する代わりに、Litmus氏はまず測定すべきものと理由を特定し、その答えを正当化されたステージ毎のメトリクスポートフォリオを構築するための制約に変換する。我々は、AutoMetricsとDynamicRubricの3つのベースラインに対して、3つの実際のコード定義AIパイプライン(財務会計グループ、科学的QA、固有のリスク評価)でLitmusを評価する。 Litmusは、パイプラインステージを拡大し、ほぼゼロの冗長なポートフォリオを生成し、科学的なQA(Spearman $ρ=0.72$ vs. $0.47$)と、メトリクス設計中にラベルを使わずに監査フレームワークの2つのコンポーネントと重なり合う信頼区間を含む、すべての3つのパイプラインのローあたりの品質ラベルに対して、第一に有効である。本研究の結果は, 自動計量実装から自動計量仕様へのシフトを裏付けるものであり, 計算対象の指標を問う前に, 評価システムは測定すべき項目と理由を問うべきである。

論文の概要: Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems

関連論文リスト