Fugu-MT 論文翻訳(概要): Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

論文の概要: Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

arxiv url: http://arxiv.org/abs/2606.11042v2
Date: Wed, 10 Jun 2026 15:20:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 14:23:44.404871
Title: Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Title（参考訳）: ワークフローGYM:実世界のプロフェッショナル分野におけるコンピュータ利用エージェントタスクの長期評価に向けて
Authors: Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Xinying Liu, Xingzu Liu, Lingling Zhang, Xinjie Chen, Yujia Qin, Wangchunshu Zhou, Zhiyong Wu, Yang Liu, Jiaheng Liu, Lei Zhang, Shen Yan, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang,
Abstract要約: 本稿では,専門分野と専門ソフトウェア環境を中心とした長期GUIタスクのベンチマークを紹介する。最強のモデルでさえ、30%以上の成功率しか達成していません。本研究は,現在のエージェントシステムの限界について重要な知見を提供し,次世代のGUIエージェント研究の鍵となる方向性を示唆するものである。
参考スコア（独自算出の注目度）: 91.11458140756208
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.
Abstract（参考訳）: 近年、AIエージェントがますます複雑で現実的なタスクに対処するための急速な進化を目撃している。しかし、既存のベンチマークでは、エージェントがグラフィカルなユーザーインターフェイスを操作できるかどうかをほとんど評価していない。現在のGUIベンチマークは依然として汎用ソフトウェア、比較的単純なアプリケーション、短期的なタスクに重点を置いており、現代のエージェントがドメイン固有の専門的ソフトウェアを自律的に運用し、エンドツーエンドで経済的に価値のある仕事を達成するためのユーザー指示に従うことができるかどうかはほとんど分かっていない。このギャップを埋めるために、プロフェッショナルドメインと専門ソフトウェア環境を中心とした長期GUIタスクのベンチマークであるWorkflow-GYMを紹介します。最先端モデルに関する広範な実験を通じて、最強モデルでさえ30%以上の成功率しか達成していないことが判明した。さらに分析によると、現在のエージェントは、ワークフローステージの欠落、エラーの伝播、客観的なドリフト、プロのソフトウェア環境の理解の不十分といった、長期にわたるワークフローの一貫性を維持するのに苦労している。本研究は,現在のエージェントシステムの限界について重要な知見を提供し,次世代のGUIエージェント研究の鍵となる方向性を示唆するものである。

論文の概要: Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

関連論文リスト