Fugu-MT 論文翻訳(概要): Step-level Optimization for Efficient Computer-use Agents

論文の概要: Step-level Optimization for Efficient Computer-use Agents

arxiv url: http://arxiv.org/abs/2604.27151v1
Date: Wed, 29 Apr 2026 19:59:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:53.784858
Title: Step-level Optimization for Efficient Computer-use Agents
Title（参考訳）: 効率的なコンピュータ利用エージェントのステップレベル最適化
Authors: Jinbiao Wei, Kangqi Ni, Yilun Zhao, Guo Gan, Arman Cohan,
Abstract要約: 我々は、強力なコンピュータ利用エージェントは、実際は高価で遅いと論じている。本稿では,コンピュータ利用エージェントのためのイベント駆動ステップレベルカスケードを提案する。
参考スコア（独自算出の注目度）: 51.29573359027217
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfaces instead of relying on brittle, application-specific integrations. Despite recent advances in benchmark performance, strong computer-use agents remain expensive and slow in practice, since most systems invoke large multimodal models at nearly every interaction step. We argue that this uniform allocation of compute is fundamentally inefficient for long-horizon GUI tasks. Such trajectories are highly heterogeneous: many steps are routine and can be handled reliably by smaller, cheaper policies, while errors tend to concentrate at a relatively small number of high-risk moments. Across computer-use benchmarks, these failures repeatedly take two forms: progress stalls, where the agent loops, repeats ineffective actions, or fails to make meaningful progress, and silent semantic drift, where the agent continues taking locally plausible actions after already deviating from the user's true goal. To address this inefficiency, we propose an event-driven, step-level cascade for computer-use agents that runs a small policy by default and escalates to a stronger model only when lightweight learned monitors detect elevated risk. Our framework combines two complementary signals: a Stuck Monitor that detects degraded progress from recent reasoning-action history and triggers recovery, and a Milestone Monitor that identifies semantically meaningful checkpoints where sparse verification is most informative for catching drift. This design turns always-on frontier-model inference into adaptive, on-demand compute allocation over the course of an evolving interaction. The framework is modular and deployment-oriented: it can be layered on top of existing computer-use agents without changing the underlying agent architecture or retraining the large model.
Abstract（参考訳）: コンピュータ利用エージェントは、不安定でアプリケーション固有の統合に頼るのではなく、任意のグラフィカルユーザインターフェースと直接対話できるため、一般的なソフトウェア自動化への有望な道を提供する。近年のベンチマーク性能の進歩にもかかわらず、強力なコンピュータ利用エージェントは、ほとんどすべてのインタラクションステップにおいて大きなマルチモーダルモデルを呼び出すため、高価で実行が遅いままである。この均一な計算割り当ては、長期GUIタスクには基本的に非効率である、と我々は主張する。このような軌道は非常に異種であり、多くのステップはルーチンであり、より小さく安価なポリシーによって確実に処理できるが、エラーは比較的少数の高リスクモーメントに集中する傾向がある。コンピュータ使用ベンチマーク全体で、これらの障害は2つの形式を繰り返す。プログレス・ストール(progress stalls)、エージェントがループを繰り返す、非効果的なアクションを繰り返す、あるいは意味のある進歩をしない、サイレント・セマンティック・ドリフト(Science semantic drift)。この非効率性に対処するために、我々は、標準で小さなポリシーを実行し、軽量な学習モニタが高リスクを検出する場合にのみ、より強力なモデルにエスカレートする、コンピュータ利用エージェントのためのイベント駆動のステップレベルカスケードを提案する。我々のフレームワークは2つの補完的な信号を組み合わせており、最近の推論行動履歴から劣化した進行を検知し、回復をトリガーするStuck Monitorと、漂流をキャッチするのに最も有効なスパース検証を行う意味的に意味のあるチェックポイントを識別するMilestone Monitorである。この設計は、常にオンのフロンティアモデル推論を、進化する相互作用の過程で適応的でオンデマンドな計算割り当てに変換する。基盤となるエージェントアーキテクチャを変更したり、大きなモデルを再トレーニングしたりすることなく、既存のコンピュータ利用エージェントの上にレイヤ化することができる。

論文の概要: Step-level Optimization for Efficient Computer-use Agents

関連論文リスト