Fugu-MT 論文翻訳(概要): OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

論文の概要: OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

arxiv url: http://arxiv.org/abs/2601.20650v2
Date: Mon, 02 Feb 2026 12:28:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 15:03:50.692617
Title: OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks
Title（参考訳）: OS-Marathon: 長期反復タスクにおけるコンピュータ利用エージェントのベンチマーク
Authors: Jing Wu, Daphne Barretto, Yiye Chen, Nicholas Gydé, Yanan Jian, Yuhang He, Vibhav Vineet,
Abstract要約: ロングホライズンで反復的なタスクは、プロフェッショナルな設定で一般的である。これらのタスクは、処理するデータのサイズに比例して極端な長さまで拡張できるため、人間にとって退屈な作業であることが多い。我々は2つのドメインにまたがる242の長期的反復的なタスクからなるOS-Marathonを構築し、SOTA(State-of-the-art)エージェントを評価する。
参考スコア（独自算出の注目度）: 36.99798674847767
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-horizon, repetitive workflows are common in professional settings, such as processing expense reports from receipts and entering student grades from exam papers. These tasks are often tedious for humans since they can extend to extreme lengths proportional to the size of the data to process. However, they are ideal for Computer-Use Agents (CUAs) due to their structured, recurring sub-workflows with logic that can be systematically learned. Identifying the absence of an evaluation benchmark as a primary bottleneck, we establish OS-Marathon, comprising 242 long-horizon, repetitive tasks across 2 domains to evaluate state-of-the-art (SOTA) agents. We then introduce a cost-effective method to construct a condensed demonstration using only few-shot examples to teach agents the underlying workflow logic, enabling them to execute similar workflows effectively on larger, unseen data collections. Extensive experiments demonstrate both the inherent challenges of these tasks and the effectiveness of our proposed method. Project website: https://os-marathon.github.io/.
Abstract（参考訳）: 長期にわたる反復的なワークフローは、レシートからの費用報告の処理や試験論文からの学生の成績の入力など、プロフェッショナルな設定で一般的である。これらのタスクは、処理するデータのサイズに比例して極端な長さまで拡張できるため、人間にとって退屈な作業であることが多い。しかし、コンピュータ・ユース・エージェント(CUA)には、体系的に学習可能な論理を伴うサブワークフローが繰り返し発生するため、これらは理想的である。評価ベンチマークの欠如を主要なボトルネックとして認識し,2つの領域にまたがる242の長時間の繰り返しタスクからなるOS-Marathonを構築し,SOTA(State-of-the-art)エージェントを評価する。次に、数ショットの例を使って、コスト効率のよいデモを構築する方法を導入し、エージェントに基礎となるワークフローロジックを教え、より大きく、見えないデータコレクション上で、同様のワークフローを効果的に実行できるようにします。大規模実験により,これらの課題と提案手法の有効性が示された。プロジェクトウェブサイト:https://os-marathon.github.io/.com

論文の概要: OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

関連論文リスト