Fugu-MT 論文翻訳(概要): RepoZero: Can LLMs Generate a Code Repository from Scratch?

論文の概要: RepoZero: Can LLMs Generate a Code Repository from Scratch?

arxiv url: http://arxiv.org/abs/2605.07122v2
Date: Wed, 13 May 2026 09:42:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.785283
Title: RepoZero: Can LLMs Generate a Code Repository from Scratch?
Title（参考訳）: RepoZero: LLMはスクラッチからコードリポジトリを生成することができるか?
Authors: Zhaoxi Zhang, Yiming Xu, Jiahui Liang, Weikang Li, Yunfang Wu,
Abstract要約: RepoZeroは、完全に自動化された実行ベースのレポジトリレベルの生成をスクラッチから検証できる最初のベンチマークである。我々の結果は、RepoZeroをエンドツーエンドのコード生成のための、困難でスケーラブルで信頼性の高いテストベッドとして確立しています。
参考スコア（独自算出の注目度）: 13.87780777614509
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have recently shown remarkable progress in code generation, yet their ability to construct complete software repositories from scratch remains poorly understood. A fundamental bottleneck is the lack of verifiable and scalable evaluation: existing benchmarks either focus on patch-based editing or rely on human or LLM-based judgments, which introduce bias and limit reproducibility. In this work, we present RepoZero, the first benchmark that enables fully automated, execution-based verification of repository-level generation from scratch. Our key idea is to reformulate generation as repository reproduction: given only API specifications, an agent must re-implement an entire repository such that its behavior matches the original implementation. This design allows for strict black-box validation via output equivalence, while naturally supporting large-scale construction by reusing existing open-source repositories. To further mitigate data leakage and shortcut solutions, we introduce cross-language constraints and a sandboxed evaluation protocol. Building on this benchmark, we propose an Agentic Code-Test Evolution (ACE) framework that performs iterative test generation and error-driven refinement, enabling effective test-time scaling for repository-level synthesis. Extensive experiments across multiple state-of-the-art LLMs and agent frameworks reveal that even the strongest LLM agents achieve only limited pass rates (30\% - 55\%), exposing a substantial gap between current capabilities and real-world software development requirements. Our results establish RepoZero as a challenging, scalable, and reliable testbed for end-to-end code generation, and highlight self-verification via test generation as a critical direction for advancing LLM-based coding agents.
Abstract（参考訳）: 大規模言語モデル(LLM)は、最近、コード生成の顕著な進歩を示しているが、スクラッチから完全なソフトウェアリポジトリを構築する能力は、まだよく分かっていない。既存のベンチマークでは、パッチベースの編集に焦点を当てるか、あるいは人間やLLMベースの判断に依存しており、バイアスや再現性を制限する。本研究では,リポジトリレベルの生成をスクラッチから完全に自動化し,実行ベースの検証を可能にする最初のベンチマークであるRepoZeroを紹介する。 API仕様のみを前提として、エージェントは、その振る舞いが元の実装と一致するように、レポジトリ全体を再実装する必要があります。この設計により、出力等価性による厳格なブラックボックス検証が可能であり、既存のオープンソースリポジトリを再利用することで、大規模構築を自然にサポートする。データ漏洩とショートカットソリューションをさらに緩和するため,クロスランゲージ制約とサンドボックス評価プロトコルを導入する。本稿では,このベンチマークに基づいて,反復的なテスト生成とエラー駆動リファインメントを行うAgentic Code-Test Evolution (ACE) フレームワークを提案する。複数の最先端のLLMおよびエージェントフレームワークにわたる大規模な実験により、最強のLLMエージェントでさえ、制限されたパスレート(30 % - 55 %)しか達成せず、現在の能力と実際のソフトウェア開発要件の間に大きなギャップがあることが判明した。この結果から, エンドツーエンドのコード生成において, RepoZeroを困難でスケーラブルで信頼性の高いテストベッドとして確立し, テスト生成による自己検証をLCMベースのコーディングエージェントを進化させる重要な方向として強調した。

論文の概要: RepoZero: Can LLMs Generate a Code Repository from Scratch?

関連論文リスト