Fugu-MT 論文翻訳(概要): ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

論文の概要: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

arxiv url: http://arxiv.org/abs/2510.00857v1
Date: Wed, 01 Oct 2025 13:08:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.569839
Title: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
Title（参考訳）: ManagerBench: 自律LLMにおけるセーフティ・プラグマティズムのトレードオフの評価
Authors: Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov,
Abstract要約: 大きな言語モデル(LLM)が進化するにつれて、その行動の安全性を評価することが重要になる。現実的な人為的な管理シナリオにおけるLCM意思決定を評価するベンチマークである ManagerBench を紹介する。潜在的な害が無生物にのみ向けられる並列制御セットは、モデルのプラグマティズムを測定し、過度に安全である傾向を特定する。
参考スコア（独自算出の注目度）: 48.50397204177239
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models' harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark & code available at https://github.com/technion-cs-nlp/ManagerBench.
Abstract（参考訳）: 大きな言語モデル(LLM)が会話アシスタントから自律エージェントへと進化するにつれて、その行動の安全性を評価することが重要になる。以前の安全性ベンチマークでは、主に有害なテキストなどの有害なコンテンツの発生を防ぐことに重点を置いていた。しかし、運用目標への最も効果的な経路が人間の安全と矛盾する場合、有害な行動を取るエージェントの課題を見落としている。このギャップに対処するため、現実的な人為的な管理シナリオにおけるLCM意思決定を評価するベンチマークであるMan ManagerBenchを紹介します。それぞれのシナリオでは、運用目標を達成する実用的だが有害なアクションと、運用パフォーマンスの悪化につながる安全なアクションとを選択せざるを得ない。潜在的な害が無生物にのみ向けられる並列制御セットは、モデルのプラグマティズムを測定し、過度に安全である傾向を特定する。以上の結果から,この安全プラグマティズムトレードオフをナビゲートする際,フロンティアLSMは不十分であることが示唆された。多くの者は、運用目標を前進させる有害な選択肢を一貫して選択する一方で、過度に安全で非効率になるためにのみ危害を避けている。批判的に言えば、このミスアライメントは、モデルの害評価が人間の判断と一致しているため、害を知覚できないことではなく、優先順位付けの欠陥によるものである。 ManagerBenchはエージェント的振る舞いの中核的なコンポーネントのための挑戦的なベンチマークである。 Benchmark & code available at https://github.com/technion-cs-nlp/ManagerBench.com

論文の概要: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

関連論文リスト