Fugu-MT 論文翻訳(概要): Detecting Safety Violations Across Many Agent Traces

論文の概要: Detecting Safety Violations Across Many Agent Traces

arxiv url: http://arxiv.org/abs/2604.11806v1
Date: Mon, 13 Apr 2026 17:59:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.74956
Title: Detecting Safety Violations Across Many Agent Traces
Title（参考訳）: エージェントトレースにおける安全違反の検出
Authors: Adam Stein, Davis Brown, Hamed Hassani, Mayur Naik, Eric Wong,
Abstract要約: 本稿では,クラスタリングとエージェント検索を組み合わせることで,自然言語で指定された違反を明らかにするMeerkatを紹介する。 Meerkatは誤用、不正調整、タスクゲームの設定などを通じて、モニター上の安全違反の検出を大幅に改善している。
参考スコア（独自算出の注目度）: 41.40594315855062
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.
Abstract（参考訳）: 安全違反を特定するために、監査人はしばしば大量のエージェントトレースを探索する。この探索は、障害が稀で複雑で、時には逆向きに隠され、複数のトレースが一緒に分析された場合にのみ検出されるため、難しい。これらの課題は、誤用キャンペーン、隠蔽サボタージュ、報酬ハッキング、即時注入といった様々な設定で生じる。既存のアプローチは、いくつかの理由でここで苦労しています。トレース毎の判断は、トレース間でのみ見える障害を見逃し、単純でエージェント的な監査は大きなトレースコレクションにスケールせず、固定されたモニターは予期しない振る舞いに対して脆弱である。本稿では,クラスタリングとエージェント検索を組み合わせることで,自然言語で指定された違反を明らかにするMeerkatを紹介する。構造化された検索と、有望な領域の適応的な調査を通じて、Meerkat氏は、シードシナリオや固定ワークフロー、徹底的な列挙に頼ることなく、スパース障害を見つける。不正使用、不正調整、タスクゲームの設定などを通じて、Meerkatはベースラインモニタに対する安全違反の検出を大幅に改善し、トップエージェントベンチマークで広範な開発者の不正行為を発見し、前回の監査よりもCyBenchでの報酬ハッキングの例が約4倍多いことを発見した。

論文の概要: Detecting Safety Violations Across Many Agent Traces

関連論文リスト