Fugu-MT 論文翻訳(概要): Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

論文の概要: Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

arxiv url: http://arxiv.org/abs/2604.22119v1
Date: Thu, 23 Apr 2026 23:44:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.291654
Title: Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
Title（参考訳）: AIの創発的戦略推論リスク:分類駆動評価フレームワーク
Authors: Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris,
Abstract要約: 大規模言語モデル(LLM)は、自身の目的を達成する行動に関与している。これには、詐欺(故意に誤解を招くユーザや評価者)、評価ゲーム(安全テスト中のパフォーマンスを戦略的に操作する)、報酬ハッキングなどが含まれる。自動行動リスク評価のための分類駆動型エージェントフレームワークであるESRRSimを紹介する。
参考スコア（独自算出の注目度）: 63.74295981594549
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.
Abstract（参考訳）: 推論能力と展開範囲の増大に伴い、大規模言語モデル(LLM)は、自分たちの目的を達成する行動に従事する能力を獲得します。それらは、詐欺(故意に誤解を招くユーザーや評価者)、評価ゲーム(安全テスト中のパフォーマンスを戦略的に操作する)、報酬ハッキング(不正な目的を暴露する)などである。これらのリスクを体系的に理解し、ベンチマークすることは、依然としてオープンな課題である。このギャップに対処するために,自動行動リスク評価のための分類駆動型エージェントフレームワークであるESRRSimを紹介する。我々は7つのカテゴリの拡張可能なリスク分類を構築し、20のサブカテゴリに分解する。 ESRRSimは、判断に依存しないスケーラブルなアーキテクチャにおいて、モデル応答と推論トレースの両方を評価する二重ルーリックと組み合わせて、忠実な推論を引き出すように設計された評価シナリオを生成する。 11個のLCMによる評価は、リスクプロファイル(14.45%から72.72%の範囲)のかなりのばらつきを示し、劇的な世代改良により、モデルが評価コンテキストを認識し、適応する可能性があることを示唆している。

論文の概要: Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

関連論文リスト