Fugu-MT 論文翻訳(概要): Toward a Principled Framework for Agent Safety Measurement

論文の概要: Toward a Principled Framework for Agent Safety Measurement

arxiv url: http://arxiv.org/abs/2605.01644v1
Date: Sat, 02 May 2026 23:34:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.865377
Title: Toward a Principled Framework for Agent Safety Measurement
Title（参考訳）: エージェント安全測定の原則化に向けて
Authors: Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan,
Abstract要約: LLMエージェントは、テキストだけでなくアクションを発行し、一度取り込まれると、これらのアクションを無効にすることはできない。我々は、エージェントの安全性はサンプリングではなく、検索によって測定されるべきであると主張している。予算内軌道空間を探索するフレームワークであるBOAを適用し,安全性スコアを報告する。
参考スコア（独自算出の注目度）: 12.87651053316749
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent-safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate -- blind to the long-tail trajectories where unsafe behavior may arise from low-probability but non-negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in-budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent-environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent-safety workloads, BOA discovers unsafe trajectories that greedy and sampled evaluations miss. BOA can additionally be used for ranking models, defenses, and attacks, all on the same scale, with manageable GPU costs.
Abstract（参考訳）: LLMエージェントは、テキストだけでなくアクションを発行し、一度取り込まれると、これらのアクションを無効にすることはできない。しかし、今日のエージェントセーフティ評価は、greedyまたはいくつかのサンプルロールアウトを実行し、単一のセーフ/アンセーフレートを報告します。我々は、エージェントの安全性はサンプリングではなく、検索によって測定されるべきであると主張している。 BOAは、配置設定(モデル、デコーダ、プロンプト、環境、判断者、確率予算)を与えられた場合、予算内軌道空間を探索し、安全スコアを報告する。 BOA は 1 つの LLM ラウンド内およびエージェント環境相互作用ツリー内を所定の確率予算で探索し,バッチデコード/アジャッジ,プレフィックスキャッシング,チャンクツリー拡張による探索を実用的なものにする。エージェントセーフティのワークロードにおいて、BOAは、欲求とサンプル評価が見逃す安全でないトラジェクトリを発見する。さらにBOAは、管理可能なGPUコストで、モデル、ディフェンス、アタックのランク付けに使用することができる。

論文の概要: Toward a Principled Framework for Agent Safety Measurement

関連論文リスト