Fugu-MT 論文翻訳(概要): MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

論文の概要: MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

arxiv url: http://arxiv.org/abs/2606.16748v1
Date: Mon, 15 Jun 2026 14:08:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:34.599097
Title: MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents
Title（参考訳）: MyPCBench: 個人の知的コンピュータ利用エージェントのベンチマーク
Authors: Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov,
Abstract要約: コンピュータ利用エージェントの最近のベンチマークは、非対人環境におけるモデルを評価する。 MyPCBenchは、コンピュータ利用エージェントを、シミュレーションされた現実世界のWebアプリケーションが17個あるLinuxデスクトップ上のパーソナルアシスタントとしてテストする。
参考スコア（独自算出の注目度）: 43.32396184134805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.
Abstract（参考訳）: コンピュータ利用エージェントの最近のベンチマークは、非対人環境におけるモデルを評価する。これにより、パーソナルアシスタントは、コンテキスト、履歴データ、ログインアカウントを含む、ユーザのデジタルライフ全体にわたって機能することが期待される評価とデプロイメントのギャップが残る。ライブのWeb評価では、ログインや個人情報を必要とするサイトや、本物のパーソナルアシスタントが運転しなければならないサイトを動作させることができない。私たちはMyPCBenchを紹介します。これは、コンピュータ利用エージェントを、シミュレーションされた現実世界のWebアプリケーションとフルデスクトップスタックで人口密度の高いLinuxデスクトップ上でパーソナルアシスタントとしてテストします。この環境で184のタスクを定義し、それぞれがOpenClawコミュニティから引き出された真の要求にインスパイアされ、コンピュータ+bashツールサーフェスを統一した6つのクローズドおよびオープンウェイトモデルをベンチマークする。最良のモデルであるClaude Opus 4.6は、タスクの55.4\%を完全に解決している。モデル障害は多くのアプリケーションにまたがるタスクと、パーソナライゼーションがアシスタントを最も強調する長いトラジェクトリにクラスタされる。環境、タスクセット、エージェントハーネスはhttps://mypcbench.com.comで公開しています。

論文の概要: MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

関連論文リスト