Fugu-MT 論文翻訳(概要): Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

論文の概要: Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

arxiv url: http://arxiv.org/abs/2605.14747v1
Date: Thu, 14 May 2026 12:14:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.810691
Title: Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining
Title（参考訳）: Video2GUI: 汎用GUIエージェント事前学習のための大規模インタラクション軌跡の合成
Authors: Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Hao Tian,
Abstract要約: Video2GUIは、未ラベルのインターネットビデオから直接GUIインタラクショントラジェクトリを抽出するフレームワークである。 1500以上のアプリケーションとWebサイトにまたがる1200万のインタラクショントラジェクトリを含む大規模データセットであるWildGUIを構築した。将来的なGUIエージェントの研究をサポートするために、WildGUIデータセットとVideo2GUIパイプラインの両方をリリースします。
参考スコア（独自算出の注目度）: 29.391227440359287
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.
Abstract（参考訳）: マルチモーダルな大規模言語モデルの最近の進歩は、グラフィカルユーザインタフェース(GUI)エージェントへの関心の高まりを招いているが、その一般化は、様々な現実世界のアプリケーションにまたがる大規模トレーニングデータの不足によって、いまだに制約されている。既存のデータセットはコストのかかる手動アノテーションに大きく依存しており、通常は狭いドメインに限定される。この課題に対処するために,未ラベルのインターネットビデオから直接GUIインタラクショントラジェクトリを抽出する,完全に自動化されたフレームワークであるVideo2GUIを提案する。 Video2GUIは粗いフィルタリング戦略を用いて高品質なGUIチュートリアルビデオを識別し、それらを構造化されたエージェントトラジェクトリに変換する。このパイプラインを5億のビデオメタデータエントリに適用することにより、1500以上のアプリケーションやWebサイトにわたる1200万のインタラクショントラジェクトリを含む大規模なデータセットであるWildGUIを構築します。 WildGUI上でのQwen2.5-VLとMimo-VLの事前トレーニングは、複数のGUIグラウンドとアクションベンチマークで5-20%の一貫性のある改善を実現し、最先端のパフォーマンスにマッチまたは超えている。将来的なGUIエージェントの研究をサポートするために、WildGUIデータセットとVideo2GUIパイプラインの両方をリリースします。

論文の概要: Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

関連論文リスト