Fugu-MT 論文翻訳(概要): GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

論文の概要: GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

arxiv url: http://arxiv.org/abs/2603.15039v1
Date: Mon, 16 Mar 2026 09:45:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:57.992155
Title: GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
Title（参考訳）: GUI-CEval: モバイルGUIエージェントのための階層的で総合的な中国語ベンチマーク
Authors: Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, Ying Huang,
Abstract要約: GUI-CEvalは、物理的なデバイス環境上に構築された中国のモバイルGUIエージェントのための最初の包括的なベンチマークである。 4つのデバイスタイプにまたがる201のメインストリームアプリにまたがって、原子能力と現実的なアプリケーションレベルのパフォーマンスを5次元(知覚、計画、リフレクション、実行、評価)で評価する2レベル構造を採用している。
参考スコア（独自算出の注目度）: 19.27396264271709
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)の最近の進歩により、視覚知覚、相互モーダル推論、インタラクティブ制御が可能な移動体GUIエージェントが実現された。しかし、既存のベンチマークは主に英語中心であり、中国のモバイルエコシステムの言語的および相互作用的特性を捉えていない。 GUIグラウンディングやオフラインエージェントといった独立したスキルにも重点を置いており、認識から実行までの完全な能力チェーンを評価するための統一的できめ細かいフレームワークが欠如している。このギャップに対処するため,中国初のモバイルGUIエージェントの総合ベンチマークであるGUI-CEvalを紹介した。 GUI-CEvalは4つのデバイスタイプにまたがる201のメインストリームアプリにまたがっており、知覚、計画、リフレクション、実行、評価という5つの側面に沿って、原子能力と現実的なアプリケーションレベルのパフォーマンスの両方を評価する2レベル構造を採用している。すべてのデータは、認証と再現性を確保するために、多段階のマニュアルプロセスを通じて収集され、検証される。 Qwen2.5-VLやUI-TARSのような20の代表的なMLLMやマルチエージェントシステムに対する大規模な実験では、ほとんどのMLLMは依然として反射的意思決定と後自己評価において明確な弱点を示し、実際の相互作用における信頼性を制限している。 GUI-CEvalは、機能診断をガイドし、中国のモバイルGUIエージェントの開発を進めるための総合的かつ解釈可能なベンチマークを提供することを期待している。

論文の概要: GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

関連論文リスト