Fugu-MT 論文翻訳(概要): Design and Report Benchmarks for Knowledge Work

論文の概要: Design and Report Benchmarks for Knowledge Work

arxiv url: http://arxiv.org/abs/2605.23262v1
Date: Fri, 22 May 2026 06:03:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.218755
Title: Design and Report Benchmarks for Knowledge Work
Title（参考訳）: 知識労働のための基準の設計と報告
Authors: Yining Hua, Hongbin Na, Cyrus Ayubcha, Levi Lian,
Abstract要約: 本稿では,評価結果に付随する作業クレームを,ベンチマークタスクがどのように表現するかを明確にするための3段階のアプローチを提案する。評価中の作業アクティビティを一般的なベンチマークタスクと区別するために、O*NETの作業タスクデータベースから18の作業アクティビティの在庫を抽出する。
参考スコア（独自算出の注目度）: 5.13982016225783
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.
Abstract（参考訳）: LLMエージェントの開発は、コーディング、研究、医療など、ナレッジワークAIに関する活動の活発化につながっている。しかしながら、現在のナレッジワーク評価とベンチマーク設計は、従来のNLPタスクの論理に大きく従っている。その結果、より高いベンチマーク性能は、システムが現実世界の配置設定で知識処理を実行できることを確実に示さない。本稿では,評価下での作業活動の定義,テスト済みの設定の特定,適切な作業製品の評価など,評価されたタスクが,それらのスコアに付随する作業クレームをどのように表すかを明確にするための3段階のアプローチを提案する。我々は、知識労働が、役割と責任、現地の材料とツール、下流のワークフローで使用できなければならないアーティファクトを通じて組織されていることを示すワークスタディをレビューする。次に、これらの懸念をベンチマーク設計とレポートのガイダンスに変換し、作業アクティビティにタスクをどのようにマッピングするか、テストされた設定が材料、ツール、役割、制約をどのように指定するか、システムが残した作業製品にどのようにフォーカスするか、などを説明します。評価中の作業アクティビティを一般的なベンチマークタスクと区別するために、O{*}NETの作業タスクデータベースから18の作業アクティビティの在庫を抽出する。提案手法は,非コードで作業可能なベンチマークであるGDPval,最終回答によってスコア付けされた文書分析ベンチマークであるOfficeQA Pro,実行可能な製品を用いたソフトウェアエンジニアリングベンチマークであるAPEX-SWEの3つのベンチマークケース分析を通じて実証する。これらのケースは、ベンチマーク設計の選択が、スコアがサポートする最強の作業クレームを形成する方法を示し、ベンチマークされたタスク、テストされた設定、評価された製品、より広範な作業クレームの間にギャップが生じるかを示しています。

論文の概要: Design and Report Benchmarks for Knowledge Work

関連論文リスト