Fugu-MT 論文翻訳(概要): MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

論文の概要: MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

arxiv url: http://arxiv.org/abs/2606.22557v1
Date: Sun, 21 Jun 2026 15:32:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:39:32.469152
Title: MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop
Title（参考訳）: MacAgentBench: 実世界のmacOSデスクトップ上でAIエージェントをベンチマークする
Authors: Yikun Fu, Bowen Fu, Zhenyu Wu, Shuang Cheng, Xiaowei Sun, Bowen Yang, Zehao Li, Yibo Zhao, Zichen Ding, Zhoumianze Liu, Shijie Wang, Biqing Qi, Bowen Zhou,
Abstract要約: コンピュータ利用エージェント(CUA)は、デスクトップ自動化において急速に進歩し、常に自動化されたCUAをデプロイするユーザが増えている。メトリクスを含む既存のベンチマークでは、フレームワークを拡張せずにエージェントを評価し、バイナリ評価に依存している。我々は、25のアプリケーションにまたがる676のタスクからなる包括的なエージェントベンチマークであるMacAgentBenchを紹介する。
参考スコア（独自算出の注目度）: 32.14321067864185
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computer use agents (CUAs) have advanced rapidly in desktop automation, and a growing number of users deploy CUAs such as OpenClaw on Mac Mini for always-on automation. However, existing benchmarks, including those for macOS, evaluate agents without framework augmentation and rely on binary evaluation. As a result, they fail to capture both the framework capabilities leveraged by modern CUAs and the partial progress on long-horizon, multi-application tasks. We present MacAgentBench, a comprehensive macOS agent benchmark comprising 676 tasks across 25 applications, with nearly 60% involving both GUI and CLI interaction. The benchmark adopts deterministic rule-based evaluation and introduces fine-grained multi-checkpoint scoring with capability annotations for multi-application tasks. Experiments across three frameworks and 16 models show that the best configuration, Claude Opus 4.6 on OpenClaw, attains 73.7% Pass@1, while this advantage is primarily driven by the skill library rather than by framework design. Fine-grained metrics further reveal that models with similar Pass@1 can differ substantially in sub-goal completion. Our code and data are publicly available at https://github.com/JetAstra/MacAgentBench.
Abstract（参考訳）: コンピュータ・ユース・エージェント (CUA) はデスクトップ・オートメーションにおいて急速に進歩しており、Mac Mini に OpenClaw などの CUA を常時起動するユーザも増えている。しかし、macOSを含む既存のベンチマークでは、フレームワークを拡張せずにエージェントを評価し、バイナリ評価に依存している。その結果、最新のCUAによって活用されるフレームワーク機能と、長期のマルチアプリケーションタスクにおける部分的な進歩の両方をキャプチャできなかった。我々は、25のアプリケーションにわたる676のタスクからなる、包括的なmacOSエージェントベンチマークであるMacAgentBenchを紹介します。このベンチマークでは、決定論的ルールに基づく評価を採用し、マルチアプリケーションタスクのための機能アノテーションを備えた、きめ細かいマルチチェックポイントスコアを導入している。 3つのフレームワークと16モデルにわたる実験によると、最高の構成であるOpenClawのClaude Opus 4.6が73.7%のPass@1に達した。詳細な測定結果から、Pass@1と類似したモデルが、サブゴール完了時に大きく異なる可能性があることが分かる。私たちのコードとデータはhttps://github.com/JetAstra/MacAgentBench.comで公開されています。

論文の概要: MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

関連論文リスト