Fugu-MT 論文翻訳(概要): CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

論文の概要: CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

arxiv url: http://arxiv.org/abs/2606.22883v1
Date: Mon, 22 Jun 2026 05:50:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 03:51:48.806038
Title: CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents
Title（参考訳）: CLI-Universe:端末エージェントのための検証可能なタスク合成エンジンを目指して
Authors: Zhanbo Hua, Yifan Yao, Weihao Xie, Yongchi Zhao, Minghao Liu, Ruizhi Qiu, Zhewei Huang, Zun Wang, Yiyan Ji, Yunhai Ye, Letian Zhu, Xinping Lei, Han Li, Zhiyuan Ma, Zili Wang, Zhaoxiang Zhang, Jiaheng Liu,
Abstract要約: 端末エージェントタスクを構成する合成エンジンCLI-Universeを紹介する。 CLI-Universe-6Kと呼ばれる6000のトラジェクトリのデータセットをインスタンス化する。注目すべきは、CLI-Universe-6K上の微調整Qwen3-32Bはターミナルベンチ2.0で33.4%に達することである。
参考スコア（独自算出の注目度）: 40.27594136040026
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.
Abstract（参考訳）: 最近のLCMベースの端末エージェントは有望な能力を示しているが、高品質で実行可能なトレーニングデータの不足は依然として重大なボトルネックである。既存の合成パイプラインは通常、表面レベルのアーティファクトをタスクに再適合させ、しばしばあいまいな命令、浅い実行パス、弱い学習信号を提供する脆いテストを生成することでスケールする。これを解決するために,端末エージェントタスクを構成する基本合成エンジンであるCLI-Universeを紹介する。 CLI-Universeは、多次元能力分類(ドメイン、スキルタイプ、能力、工学の柱)にまたがる組み合わせをサンプリングして候補タスクを生成し、実世界の技術材料に関するエビデンスに導かれた深い研究を通じて各候補を基礎づける。厳格な監視を確保するため、検証済みの青写真はDocker化された環境にインスタンス化され、ルーブリックゲートのテスト構成、ヒント条件フィルタリング、厳格なフェール・ツー・パスチェックを備えた、多段階の実行可能な検証パイプラインが適用される。候補生成から検証まで、完全なパイプライン全体にわたって、候補者の約3分の2が破棄され、真の、検証可能で、非自明に困難なものだけが保持される。この枠組みを検証するため、CLI-Universe-6Kと呼ばれる6000の軌跡の高度に蒸留されたデータセットをインスタンス化する。注目すべきは、CLI-Universe-6K上の微調整Qwen3-32Bはターミナルベンチ2.0で33.4%に達することである。これにより、32B以下のパラメータでオープンソースのデータに基づいてトレーニングされたモデルに対する新たな最先端のモデルが設定され、構造化された高忠実な合成の深遠なデータ効率を示すために、いくつかのモデルよりも桁違いに大きくパフォーマンスする。

論文の概要: CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

関連論文リスト