Fugu-MT 論文翻訳(概要): Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

論文の概要: Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

arxiv url: http://arxiv.org/abs/2605.12501v1
Date: Tue, 12 May 2026 17:59:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:57.089661
Title: Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Title（参考訳）: コンピュータ利用のためのヒューマンアクション空間:データ合成とベンチマーク
Authors: Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo,
Abstract要約: コンピュータ利用エージェント(CUA)は、GPT-5.4とClaudeによって説明されているように、画面上での作業を自動化する。しかし、複雑な低周波相互作用に対する信頼性はまだ貧弱であり、ユーザの信頼を制限している。複雑な相互作用におけるモデルの能力を評価するための新しいベンチマークCUActSpotを提案する。
参考スコア（独自算出の注目度）: 59.01879944842542
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
Abstract（参考訳）: コンピュータ利用エージェント(CUA)は、GPT-5.4とClaudeによって説明されているように、画面上での作業を自動化する。しかし、複雑な低周波相互作用に対する信頼性はまだ貧弱であり、ユーザの信頼を制限している。先進的なモデルによる障害事例の解析はGUI操作における長い尾パターンを示唆している。この問題は、複雑な相互作用のためのデータの不足に起因している、という仮説を立てる。この問題に対処するために,GUI,テキスト,テーブル,キャンバス,自然画像といった5つのモードにわたる複雑なインタラクションに関するモデルの能力を評価するベンチマークCUActSpotを提案する。シーンは各モードごとに自動的に生成され、スクリーンショットと要素座標が記録され、LLMは一致する命令とアクショントレースを生成する。このコーパスのトレーニングの後、Phi-Ground-Any-4Bは32B未満のパラメータを持つオープンソースモデルより優れています。ベンチマーク、データ、コード、モデルをhttps://github.com/microsoft/Phi-Ground.gitでリリースします。

論文の概要: Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

関連論文リスト