Fugu-MT 論文翻訳(概要): Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

論文の概要: Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

arxiv url: http://arxiv.org/abs/2606.08960v1
Date: Mon, 08 Jun 2026 03:00:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.668452
Title: Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Title（参考訳）: 逆ハッカー・ピクセルループを用いたハードニングエージェントベンチマーク
Authors: Ziqian Zhong, Ivgeni Segal, Ivan Bercovich, Shashwat Saxena, Kexun Zhang, Aditi Raghunathan,
Abstract要約: エージェントベンチマークは、通常手書きで不安定な結果検証器で評価され、ハックに対して報酬を与えるために開放される。 5つの端末エージェントベンチマークで1,968のタスクを監査し、323 (16%) がフロンティアモデルによってハック可能であることを発見した。我々は,タスクごとの手動パッチを使わずに,エクスプロイトに耐性のあるバリデーションを構築する手法であるHacker-fixer loopを紹介した。
参考スコア（独自算出の注目度）: 30.90132709192538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.
Abstract（参考訳）: エージェントベンチマークは、通常手書きで不安定な結果検証器で評価され、ハックに対して報酬を与えるために開放される。 5つの端末エージェントベンチマークで1,968のタスクを監査し、323 (16%) がフロンティアモデルによってハック可能であることを発見した。これはリーダーボードランキングとRLトレーニングシグナルの両方を破損させるが、標準応答は手動とリアクティブである。我々は,タスクごとの手動パッチを使わずに,エクスプロイトに耐性のあるバリデーションを構築する手法であるHacker-fixer loopを紹介した。このループは3つのLSMエージェントを交互に置き換える: ハッカーはタスクを解決せずに検証をパスしようと試み、フィクスチャは検証者をパッチして発見されたエクスプロイトを拒絶する。ループは繰り返します: 各パッチは検証者が報いるものを再認識し、次のエクスプロイトを克服します。さらに検証者アクセスを追加し、タスク間でパッチを転送して、ループが発見するエクスプロイトを広げます。 KernelBenchでは、公に報告されたエクスプロイトのホールドアウトコーパスにおいて、攻撃成功率を62%から0%に駆動する。 Gemini 3 Flashのループはより強力なGemini 3.1 Proを駆動し、Claude Opus 4.7の攻撃成功率は76%から61%から0%に、Gemini 3.1 Proは77タスクにわたるターミナルベンチで39%から17%に向上した。我々は、現在の攻撃面のスナップショット、パッチ付き検証器、発見されたループのエクスプロイト、そして将来の作業の基礎となる実装として、ターミナルレンチ(323のハック可能な環境、3,632のハックトラジェクトリ)をリリースします。

論文の概要: Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

関連論文リスト