Fugu-MT 論文翻訳(概要): Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

論文の概要: Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

arxiv url: http://arxiv.org/abs/2606.11688v1
Date: Wed, 10 Jun 2026 06:01:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.318617
Title: Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents
Title（参考訳）: Goal-Autopilot: 意図しないロングホライゾンエージェントのための検証可能な耐火ファイアウォール
Authors: Youwang Deng,
Abstract要約: Autopilotは、サイレントな製造された成功を構造的に不可能にする実行モデルである。ハードフロアは、偽造可能なゲートが実際に実行され通過しなかった「完了」の主張を禁止した。 SWE-ベンチライトでは、ファイアウォールは製造を33.7%(StateFlow)から0.67%に減らし、対差は-33.07$ pp.である。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem -- under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds -- whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks $\times$ 3 systems $\times$ 3 models $\times$ 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38--1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48--9.81] and 25.05% [22.48--27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of $-33.07$ pp [95% CI $-36.53, -29.73$]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design -- an honest stall is recoverable; a confident wrong output shipped downstream is not.
Abstract（参考訳）: ロングホライゾンのLSMエージェントは、人間の監視がなければ、確認されていない成功を確実に報告する。我々は、エージェントが終了時に主張するべき誠実さを、能力とは別の、意図しない自律性のための第一級の指標として扱います。サイレントな成功を単に稀ではなく構造的に不可能にする実行モデルであるAutopilotを提案する。オートパイロットは、全ての動作状態を、スケジューラが一度に1つのステートレスティッチを前進させる、耐久性のあるゲート付き有限状態マシンに外部化する。 No-False-Successの定理 – ゲートの音性,フロアの執行,計画のカバレッジ – を証明した上で,終了は,信頼ポイントのみを経験的に測定可能な – を目標として示します。それぞれが状態マシンのみをリハイドするので、ステップごとのコンテキストコストは地平線内で一定である。 3150セルのペアコーパス(70タスク$\times$3システム$\times$3モデル$\times$5シード、うち50SWE-bench Liteタスクが11OSSリポジトリにまたがる50SWE-bench Liteタスクを含む)、Autopilotは0.95%のセル(95% CI 0.38--1.62)、ReflexionとStateFlowベースラインは8.10%(6.48-9.81)、25.05%(22.48-27.62)である。 SWEベンチライトでは、ファイアウォールは製造を33.7%(ステートフロー)から0.67%に減らし、対差は-33.07$ pp [95% CI $-36.53, -29.73$]である。 10基のオートパイロット製造は最強のモデルから来ているが、より弱い中間層モデル2基は700対のセルで製造されることはない。ファイアウォールは、デザインによって誠実さをカバーしている -- 正直なストールは回復可能であり、下流に出荷される確実なアウトプットはそうではない。

論文の概要: Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

関連論文リスト