Fugu-MT 論文翻訳(概要): Video-Based Reward Modeling for Computer-Use Agents

論文の概要: Video-Based Reward Modeling for Computer-Use Agents

arxiv url: http://arxiv.org/abs/2603.10178v1
Date: Tue, 10 Mar 2026 19:17:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.663514
Title: Video-Based Reward Modeling for Computer-Use Agents
Title（参考訳）: コンピュータ利用エージェントの映像ベースリワードモデリング
Authors: Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul, Yang Liu, Ranjay Krishna, Jian Kang, Jieyu Zhao,
Abstract要約: 本研究では,エージェントの内部的推論や行動に依存しないエージェント軌道からのシーケンスのシーケンスを,実行ビデオから得られる報酬モデリングについて検討する。本稿では,53kの高品質ビデオ・タスク・リワード・トリプルのデータセットであるExecution Video Reward 53k(ExeVR-53k)を紹介する。これらのコンポーネント上に構築したExecution Video Model (ExeVRM) は,タスク成功を予測するためにユーザ命令とビデオ実行シーケンスのみを必要とする。
参考スコア（独自算出の注目度）: 40.27314571412647
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.
Abstract（参考訳）: コンピュータ・ユース・エージェント(CUA)はますます有能化しつつあるが、軌道が真のユーザ・インストラクションを満たすかどうかを評価することは困難である。本研究では,エージェントの内部的推論や行動に依存しないエージェント軌道からのキーフレームのシーケンスである実行ビデオからの報酬モデリングについて検討する。ビデオ実行モデリングはメソッドに依存しないが、高度に冗長なレイアウトや、成功を決定する微妙な局所的なキューなど、重要な課題を提示する。本稿では,53kの高品質ビデオ・タスク・リワード・トリプルのデータセットであるExecution Video Reward 53k(ExeVR-53k)を紹介する。さらに、ステップレベルのアノテーションで負のサンプルを合成する逆命令変換を提案する。長時間の高精細な実行ビデオから学習を可能にするために、決定的なUI変更を保ちながら、均質な領域と永続的なトークンを除去する時空間トークンプルーニング(spatiotemporal token pruning)を設計する。これらのコンポーネント上に構築したExecution Video Reward Model (ExeVRM) は,タスク成功を予測するためにユーザ命令とビデオ実行シーケンスのみを必要とする。当社のExeVRM 8Bは、ビデオ実行評価において84.7%の精度と87.7%のリコールを達成し、Ubuntu、macOS、Windows、AndroidでGPT-5.2やGemini-3 Proなどの強力なプロプライエタリモデルを上回っ、より正確な時間的属性を提供する。これらの結果から,ビデオ実行報酬モデリングはCUAのスケーラブルでモデルに依存しない評価器として機能することが示された。

論文の概要: Video-Based Reward Modeling for Computer-Use Agents

関連論文リスト