Fugu-MT 論文翻訳(概要): Discovering Agentic Safety Specifications from 1-Bit Danger Signals

論文の概要: Discovering Agentic Safety Specifications from 1-Bit Danger Signals

arxiv url: http://arxiv.org/abs/2604.23210v1
Date: Sat, 25 Apr 2026 08:35:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.209531
Title: Discovering Agentic Safety Specifications from 1-Bit Danger Signals
Title（参考訳）: 1ビット危険信号からのエージェント安全仕様の発見
Authors: Víctor Gallego,
Abstract要約: EPO-Safeは、エージェントが反復的にアクションプランを生成し、スパースバイナリ警告を受け取り、リフレクションを通じて自然言語の振る舞い仕様を進化させるフレームワークである。 EPO-Safeは、構造化された低次元環境において、厳格に貧弱な信号から安全推論を行うことができることを示す。標準的な報酬駆動リフレクションは安全性を積極的に低下させ、リフレクションを専用の安全チャンネルと組み合わせなければならないことを示す。
参考スコア（独自算出の注目度）: 6.599344783327054
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function $R^*$, only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward $R$ may diverge from $R^*$. EPO-Safe discovers safe behavior within 1-2 rounds (5-15 episodes), producing human-readable specifications with correct explanatory hypotheses about hazards (e.g., "X cells are directionally hazardous: entering from the north is dangerous"). Critically, we show that standard reward-driven reflection actively degrades safety: agents reflecting on reward alone use the loop to justify and accelerate reward hacking, proving that reflection must be paired with a dedicated safety channel to discover hidden constraints. We further evaluate robustness to noisy oracles: even when 50% of non-dangerous steps produce spurious warnings, mean safety performance degrades by only 15% on average, though sensitivity is environment-dependent, as cross-episode reflection naturally filters inconsistent signals. Each evolved specification functions as an auditable set of grounded behavioral rules discovered autonomously through interaction, rather than authored by humans as in Constitutional AI (Bai et al., 2022).
Abstract（参考訳）: 大規模言語モデルエージェントは経験だけで隠れた安全目標を発見できるのか? EPO-Safe(Experiential Prompt Optimization for Safe Agents)は、LLMが反復的に行動計画を生成し、疎二項警告を受信し、リフレクションを通じて自然言語の行動仕様を進化させるフレームワークである。リッチテキストフィードバック(例えば、コンパイラエラーや詳細な環境応答)に依存する標準的なLLMリフレクションメソッドとは異なり、EPO-Safeは、LLMが構造化された低次元環境において厳格に貧弱な信号から安全推論を行うことができることを示した。我々は5つのAI Safety Gridworlds(Leike et al , 2017)と5つのテキストベースのシナリオアナロジーで、可視報酬$R$は$R^*$から分岐する可能性がある。 EPO-Safeは1-2ラウンド(5〜15回)で安全な行動を発見し、危険についての正しい説明的仮説を持つ人間可読仕様を生成する(例えば、「X細胞は方向的に危険であり、北から入ることは危険である」)。報酬のみを反映するエージェントは、ループを使って報酬のハッキングを正当化し、加速し、リフレクションが隠された制約を発見するために専用の安全チャンネルと組み合わせなければならないことを証明します。非危険なステップの50%が急激な警告を発生しても、感度は環境に依存しているものの、安全性能は平均で15%低下する。それぞれの仕様は、立憲AI(Bai et al , 2022)のように人間によって書かれたのではなく、相互作用を通じて自律的に発見された、監査可能な基礎的な行動規則のセットとして機能する。

論文の概要: Discovering Agentic Safety Specifications from 1-Bit Danger Signals

関連論文リスト