Fugu-MT 論文翻訳(概要): SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

論文の概要: SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

arxiv url: http://arxiv.org/abs/2602.09447v1
Date: Tue, 10 Feb 2026 06:31:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-11 20:17:43.407417
Title: SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents
Title（参考訳）: SWE-AGI: 自律エージェント時代のMoonBitによる仕様駆動ソフトウェア構築のベンチマーク
Authors: Zhirui Zhang, Hongbo Zhang, Haoxiang Fei, Zhiyuan Bao, Yubin Chen, Zhengyu Lei, Ziyue Liu, Yixuan Sun, Mingkun Xiao, Zihang Ye, Yu Zhang, Hongcheng Zhu, Yuxiang Wen, Heung-Yeung Shum,
Abstract要約: SWE-AGIはMoonBitで書かれたソフトウェアシステムのエンドツーエンド、仕様駆動の構築を評価するためのオープンソースのベンチマークである。それぞれのタスクには1000～10,000行のコアロジックを実装する必要がある。
参考スコア（独自算出の注目度）: 21.8776989802963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.
Abstract（参考訳）: 大きな言語モデル(LLM)は印象的なコーディング能力を示しているが、明示的な仕様から自動でプロダクションスケールのソフトウェアを構築する能力は未解決のままである。我々は、MoonBitで書かれたソフトウェアシステムのエンドツーエンド、仕様駆動構築を評価するためのオープンソースのベンチマークであるSWE-AGIを紹介する。 SWE-AGIタスクは、パーサ、インタプリタ、バイナリデコーダ、SATソルバを実装するためにLLMベースのエージェントを必要とする。それぞれのタスクには1000～10,000行のコアロジックを実装する必要がある。初期のMoonBitエコシステムを活用することで、SWE-AGIはデータ漏洩を最小限に抑え、エージェントはコード検索ではなく、長い水平のアーキテクチャ推論を頼らざるを得なくなる。フロンティアモデル全体では、gpt-5.3-codexは最高パフォーマンス(19/22タスク、86.4%)、クロードオプス4.6(15/22、68.2%)、キミ2.5はオープンソースモデルの中で最も優れたパフォーマンスを示している。パフォーマンスは、特にハードで仕様集約的なシステムにおいて、タスクの難しさが増すにつれて著しく低下する。振る舞い分析により、コードベースがスケールするにつれて、コードを書くよりもコードを読むことが、AI支援開発における主要なボトルネックになることが明らかになった。全体としては、仕様駆動の自律ソフトウェアエンジニアリングはますます現実的になっていますが、プロダクション規模の開発を確実にサポートする前に、大きな課題が残っています。

論文の概要: SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

関連論文リスト