Fugu-MT 論文翻訳(概要): Human-Guided Harm Recovery for Computer Use Agents

論文の概要: Human-Guided Harm Recovery for Computer Use Agents

arxiv url: http://arxiv.org/abs/2604.18847v1
Date: Mon, 20 Apr 2026 21:12:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.501804
Title: Human-Guided Harm Recovery for Computer Use Agents
Title（参考訳）: コンピュータ利用エージェントのためのヒューマンガイドハームリカバリ
Authors: Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu,
Abstract要約: LMエージェントは、実際のコンピュータシステムでアクションを実行する能力を得る。我々は、大規模に有害な行為を予防するだけでなく、予防に失敗した場合の害を効果的に軽減する方法が必要である。我々は, ポストエグゼクティションの安全対策におけるこの無視された課題に対する解決策を, 損害回復として定式化する。
参考スコア（独自算出の注目度）: 7.834133575906748
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,150 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.
Abstract（参考訳）: LMエージェントは、実際のコンピュータシステム上での行動を実行する能力を得るため、大規模に有害な行為を防止できるだけでなく、予防に失敗しても効果的に害を軽減できる方法が必要である。本研究では, 有害状態から安全状態へエージェントを最適に操り戻すという課題を, 人間の嗜好に則って解決する。評価されたリカバリ次元を識別し、自然言語のルーリックを生成するフォーマティブなユーザスタディを通じて、好みに沿ったリカバリを基礎とする。 1,150対の判断のデータセットは、包括的な長期的アプローチよりも現実的、ターゲット戦略を優先するなど、属性の重要性の文脈依存的なシフトを明らかにします。我々は、これらの学習された知見を報酬モデルで運用し、テスト時にエージェントの足場によって生成された複数の候補回復計画を再ランク付けする。回復能力を体系的に評価するために、有害状態から回復するエージェントの能力をテストする50のコンピュータ利用タスクのベンチマークであるBackBenchを紹介した。人的評価は,我々の報酬モデル足場は,ベースエージェントやルーリックベース足場よりも高品質な回復軌道を得ることを示している。これらの貢献によって、新しい種類のエージェント安全方法の基礎が築かれ、それを防ぐだけでなく、その余波をアライメントと意図でナビゲートすることで害に直面する。

論文の概要: Human-Guided Harm Recovery for Computer Use Agents

関連論文リスト