Fugu-MT 論文翻訳(概要): Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

論文の概要: Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

arxiv url: http://arxiv.org/abs/2605.07986v1
Date: Fri, 08 May 2026 16:44:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.212827
Title: Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
Title（参考訳）: AI評価のためのAppleからAppleへの道 - 実世界のユースケースからシナリオ評価まで
Authors: Yee-Yin Choong, Kristen Greene, Alice Qian, Meryem Marasli, Ziqi Yang, Sophia Chen, Laura Dabbish, Anand Rao, Hong Shen,
Abstract要約: この研究は、AI評価における方法論的透明性を提唱する。本稿では,高レベルのユースケースを詳細なシナリオに変換するプロセスを提案する。本稿では、金融サービス部門中小企業が特定した高レベルのAI活用事例について報告する。
参考スコア（独自算出の注目度）: 9.038002730735608
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI measurement science has a wide variety of methodologies and measurements for comparing AI systems, resulting in what often appear to be "apples-to-oranges" comparisons across AI evaluations. To move toward "apples-to-apples" comparisons in real-world AI evaluations, this work advocates for methodological transparency in evaluation scenarios, operational grounding, and human-centered design (HCD) principles. We propose a repeatable process for transforming high-level use cases to detailed scenarios by eliciting use cases from subject matter experts (SMEs) via a structured AI Use Case Worksheet with six key elements: use case, sector, user (direct and indirect), intended outcomes, expected impacts (positive and negative), and KPIs and metrics. We demonstrate utility of the worksheet and process in the U.S. financial services sector. This paper reports on example high-level AI use cases identified by financial services sector SMEs: cyber defense enablement, developer productivity, financial crime aggregation, suspicious activity report (SAR) filing, credit memo generation, and internal call center support. These AI use cases provided are illustrative of the process and not exhaustive. Central to our work is a three-stage expansion pipeline combining LLM prompting with human reviews to generate 107 scenarios from those use cases elicited from SMEs. This process integrates iterative human reviews at every juncture to ensure operational grounding: for scenario titles and descriptions; for core scenario elements like users, benefits and risks, and metrics; and for scenario narratives and evaluation objectives. Human checkpoints ensure scenarios remain reflective of real-world usage and human needs. We describe a validation rubric to assess scenario quality. By defining key scenario components, this work supports a more consistent and meaningful paradigm for human-centered AI evaluations.
Abstract（参考訳）: AI測定科学は、AIシステムを比較するための様々な方法論と測定方法を持ち、その結果、AI評価の「アプルズ・トゥ・レンジ」比較のように見えるものとなる。実世界のAI評価における"apples-to-apples"比較に向けて、この研究は、評価シナリオ、運用基盤、人間中心設計(HCD)の原則における方法論的透明性を提唱する。提案手法は,ユースケース,セクタ,ユーザ(直接的かつ間接的),意図された結果,期待される影響(肯定的かつ否定的),KPIとメトリクスの6つの重要な要素を備えた,構造化されたAIを介して,対象物の専門家(SME)からユースケースを引き出すことによって,詳細なシナリオに高レベルのユースケースを変換する反復可能なプロセスを提案する。米国金融サービスセクターにおけるワークシートとプロセスの有用性を実証する。本稿では,金融サービス部門が特定する高レベルのAI活用事例について報告する。サイバーディフェンスの実現,開発者の生産性,金融犯罪集約,不審な活動報告(SAR)提出,信用メモ生成,内部コールセンター支援などである。これらのAIユースケースは、プロセスのイラストレーションであり、徹底的ではありません。私たちの作業の中心は、LCMを組み合わせた3段階の拡張パイプラインで、中小企業から引き出されたユースケースから107のシナリオを生成するように、人間によるレビューで促しています。シナリオのタイトルや説明、ユーザやメリット、リスク、メトリクスといった中核的なシナリオ要素、シナリオの物語や評価目標などです。人間のチェックポイントは、シナリオが実際の使用状況や人間のニーズを反映し続けることを保証します。シナリオの品質を評価するための検証ルーブリックについて述べる。重要なシナリオコンポーネントを定義することで、この作業は、人間中心のAI評価のためのより一貫性と意味のあるパラダイムをサポートする。

論文の概要: Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

関連論文リスト