Fugu-MT 論文翻訳(概要): KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

論文の概要: KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

arxiv url: http://arxiv.org/abs/2604.15760v1
Date: Fri, 17 Apr 2026 07:04:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 22:00:19.77792
Title: KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
Title（参考訳）: KWBench: 知識労働における予期せぬ問題認識の測定
Authors: Ankit Maloo,
Abstract要約: KWBenchは、大規模言語モデルにおける未証明問題認識のベンチマークである。解決しようとする前に、プロのシナリオを特定することができます。これには、買収、契約交渉、臨床薬局、組織政治、詐欺分析、インセンティブデザインを含む223の業務が含まれている。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.
Abstract（参考訳）: KWBench(Knowledge Work Bench)の最初のバージョンは、大規模言語モデルにおける未解決問題認識のためのベンチマークであり、LLMはそれを解決する前にプロのシナリオを特定できる。既存のフロンティアベンチマークは飽和しており、ほとんどの知識-作業評価は仕様に対する抽出やタスク完了に還元されている。 KWBenchはその前のステップを目標としています。このベンチマークには、買収、契約交渉、臨床薬局、組織政治、詐欺分析、インセンティブデザインなど、実践者から得られた223のタスクが含まれている。各タスクは、形式的なゲーム理論パターン(プリンシパル・エージェント・コンフリクト、シグナリング、メカニズム設計の失敗、戦略的省略、連立力学、戦略的相互依存)を符号化し、状況と予測される障害モードの専門家の読みを記録した構造化された地上真実を運ぶ。モデルは生データとタスクプロンプトを受け取り、問題タイプの表示はない。スコーリング(Scoring)は、必須の連結チェックによってゲートされる3層ルーブリックである。強制基準は予測された間違った経路を符号化する。 16モデルの評価を行った。最高のモデルは27.9%のタスクをパスします。上位2機種は31.7%のパスで一致している。上位8つのうち44のタスクは、正確に1つのモデルによって解決される。通過の条件では、品質スコアは収束する(モデル全体で83%)。同じモデルは、要求されたときに関連するゲーム理論の概念を正しく表現し、そしてそれを適用できない。我々はKWBenchをリリースし、フロンティアモデルが知識労働でどのように評価されるのかをシフトさせ、適切な問題を認識しているかどうかを状況だけで評価する。

論文の概要: KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

関連論文リスト