Fugu-MT 論文翻訳(概要): Gym-Anything: Turn any Software into an Agent Environment

論文の概要: Gym-Anything: Turn any Software into an Agent Environment

arxiv url: http://arxiv.org/abs/2604.06126v1
Date: Tue, 07 Apr 2026 17:38:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.967698
Title: Gym-Anything: Turn any Software into an Agent Environment
Title（参考訳）: Gym-Anything: どんなソフトウェアでもエージェント環境に変える
Authors: Pranjal Aggarwal, Graham Neubig, Sean Welleck,
Abstract要約: Gym-Anythingは、あらゆるソフトウェアをインタラクティブなコンピュータ利用環境に変換するためのフレームワークである。 CUA-Worldは、医学、天文学、工学、エンタープライズシステムなど、領域にまたがる10万以上の長期的タスクのコレクションである。
参考スコア（独自算出の注目度）: 67.2443447990221
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.
Abstract（参考訳）: コンピュータ利用エージェントは、幅広いデジタル経済活動を支援することを約束している。しかしながら、現在の研究は、基本的なeコマースやOS設定タスクなど、限られた経済価値のソフトウェアセットに対する短期的なタスクに主に焦点を当てている。重要な理由は、複雑なソフトウェアのための環境を作るのにかなりの時間と人的労力を必要とするため、スケールしないからです。そこで我々は,任意のソフトウェアをインタラクティブなコンピュータ利用環境に変換するフレームワークであるGym-Anythingを紹介した。コーディングエージェントはセットアップスクリプトを書き、実際のデータをダウンロードし、ソフトウェアを構成し、正しいセットアップの証拠を生成します。独立監査エージェントは、品質チェックリストに対する環境設定の証拠を検証する。米国GDPデータに基づく経済的に価値のある職業の分類を用いて、このパイプラインを、広い職業範囲を持つ200のソフトウェアアプリケーションに適用する。 CUA-Worldは、医学や天文学から工学やエンタープライズシステムまで、ドメインにまたがる10万以上の長期的タスクのコレクションで、それぞれがリアルなデータと、列車とテストの分割で構成されている。 CUA-WorldにはCUA-World-Longも含まれている。CUA-World-Longは、既存のベンチマークをはるかに上回る500ステップ以上のタスクを必要とする、困難なロングホライゾンベンチマークである。トレーニングで成功した軌跡を希釈すると、2B視覚言語モデルが2$\times$そのサイズより優れている。また、テスト時にも同じ監査原則を適用します: 別々のVLMレビューが完了した軌跡をレビューし、残りについてフィードバックを提供し、CUA-World-Long上のGemini-3-Flashを11.5%から14.0%に改善します。我々は、現実的なコンピュータ利用エージェントにおける将来の研究を促進するために、すべてのコード、インフラストラクチャ、ベンチマークデータを公開します。

論文の概要: Gym-Anything: Turn any Software into an Agent Environment

関連論文リスト