Fugu-MT 論文翻訳(概要): Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

論文の概要: Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

arxiv url: http://arxiv.org/abs/2606.08919v1
Date: Mon, 08 Jun 2026 01:52:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.563809
Title: Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human
Title（参考訳）: 監視は能力を持つ: エージェント・ガードを主観的で不愉快な人間にキャリブレーションする
Authors: Emre Turan,
Abstract要約: エージェントは不可逆的な行動をとるので、標準的な安全パターンは、ループ内の人間承認ゲートである。ゲートは容易な部分であり、難しい部分は、フィールドが2つの誤った仮定に対して評価する判断(どの行動を止めるか)である。我々の貢献はオープンソースのエージェント監視システムであり、LLMエージェントアクションゲーティング環境でそれらを運用し、測定する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of "risky," and that the human reviewer is a perfect, infinitely-available oracle. On a hand-labeled set of 125 adversarially-weighted agent actions we show that (i) reviewers only moderately agree on what is risky (Fleiss' kappa = 0.52), so there is no single correct label; (ii) framing the guard as selective classification under asymmetric cost makes its operating limits measurable, and on hard inputs the guard cannot safely auto-decide; and (iii) when the reviewer is modeled as endogenous (fatiguing as escalation load grows), realized safety becomes an inverted-U in the escalation rate: more human oversight can make a system less safe, and the safety-optimal guard escalates below full escalation - a setting a load-aware policy also uses to resist a flooding attack that slips a malicious action past a fatigued reviewer. Agent oversight, framed this way, is not only a classification problem but a resource-allocation one: human attention is finite, and the guard's escalation policy spends it. We claim none of these mechanisms as novel - fatigue-aware learning-to-defer (FALCON), cost-sensitive deferral under workload constraints (DeCCaF), trajectory-level guarding, and reviewer-fatigue/flooding attacks are all prior art we cite. Our contribution is an open-source agent-oversight system that operationalizes and measures them in the LLM-agent action-gating setting, turning "is my guard good?" from a guess into a curve. The inverted-U and the flooding attack are modeling results that motivate a human study.
Abstract（参考訳）: LLMエージェントが真に不可逆なアクション(シェルコマンド、ファイル編集、デプロイ)を取り始めると、標準的な安全パターンは、人間のループ内での承認ゲートである。ゲートは容易な部分であり、難しい部分は、フィールドが2つの誤った仮定に対して評価する判断であり、それは「リスキー」という根本的真実の概念があり、人間のレビュアーは完璧で無限に利用可能な神託である、というものである。 125個の逆加重作用のハンドラベル集合について (i)レビュアーは、リスクのあるものにのみ適度に同意する(Fleiss' kappa = 0.52)ので、単一の正しいラベルは存在しない。二非対称コストの選択的分類としてガードをフレーミングすることは、その動作限界を測定でき、かつ、ハードインプットにおいてガードは、安全かつ自己決定できない。三審査員が内因性(エスカレーション負荷の増加に満足する)としてモデル化された場合、安全性はエスカレーション率において逆Uとなり、より人間の監視によりシステムの安全性が低下し、安全最適ガードはフルエスカレーション以下にエスカレーションされる。このようにフレーム化されたエージェント監視は、分類の問題だけでなく、リソース割り当ての問題である:人間の注意は有限であり、ガードのエスカレーションポリシーはそれを使う。これらのメカニズムはいずれも,FALCON(Fref-aware Learning-to-defer),DeCCaF(DeccaF)によるコスト感性推論,トラジェクトリレベルのガード,レビュアー・ファレング/フローディングアタックなど,新しいものではない,と我々は主張する。我々の貢献はオープンソースのエージェント監視システムで、LSMエージェントのアクションゲーティング設定でそれらを運用し、測定し、推測から曲線へと"私のガードは良いか? 逆Uと洪水攻撃は、人間の研究を動機づけるモデリング結果である。

論文の概要: Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

関連論文リスト