Fugu-MT 論文翻訳(概要): ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

論文の概要: ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

arxiv url: http://arxiv.org/abs/2512.24965v1
Date: Wed, 31 Dec 2025 16:51:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-01 23:27:28.713744
Title: ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
Title（参考訳）: ショーUI-$π$:GUI Dexterous Handsとしてのフローベース生成モデル
Authors: Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou,
Abstract要約: そこで我々は,GUI dexterous Handとして最初のフローベース生成モデルである ShowUI-$ を開発した。 ShowUI-$$は、たった450万のパラメータで26.98を達成する。
参考スコア（独自算出の注目度）: 59.222064425122795
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.
Abstract（参考訳）: ロボット工学とデジタル環境の両方において、人間のような自動化を実現するためには、巧妙な操作が可能なインテリジェントエージェントの構築が不可欠である。しかし、既存のGUIエージェントは離散的なクリック予測(x,y)に依存しており、これは、連続的な、オンザフライの認識と調整を必要とする、フリーフォームでクローズドループの軌跡(プログレスバーをドラッグするなど)を禁止している。本稿では,GUIデクスタラスハンドとして最初のフローベース生成モデルであるShowUI-$π$を開発する。一個別のクリックと連続的なドラッグを共有モデルに統合し、多様な相互作用モードにまたがる柔軟な適応を可能にする統一離散連続行動二ドラッグモデリングのためのフローベースアクション生成で、軽量なアクションエキスパートによる連続的な視覚的観察からカーソル調整を予測し、滑らかで安定した軌道を確実にする。 3) ドラッグトレーニングデータとベンチマーク。ここでは、5つのドメイン(PowerPoint、Adobe Premiere Proなど)にわたる20Kのドラッグトラジェクトリを手動で収集し、合成し、GUIエージェントのドラッグ機能を評価するための総合的なオンラインおよびオフライン評価プロトコルを備えたベンチマークであるScreenDragを紹介します。実験の結果、プロプライエタリなGUIエージェントがScreenDrag(例えばOperatorは13.27点、Gemini-2.5-CUAは22.18点)に苦戦していることがわかった。対照的に、ShowUI-$π$は、タスクの難易度とアプローチの有効性の両方を根拠に、たった450万のパラメータで26.98を達成する。デジタル世界では、GUIエージェントを人間のような器用なコントロールに進化させることを願っている。コードはhttps://github.com/showlab/showui-pi.comで入手できる。

論文の概要: ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

関連論文リスト