Fugu-MT 論文翻訳(概要): RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

論文の概要: RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

arxiv url: http://arxiv.org/abs/2604.22659v1
Date: Fri, 24 Apr 2026 15:35:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.521694
Title: RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
Title（参考訳）: RealBench: 現実世界のソフトウェア開発プラクティスに準拠した、リポジトリレベルのコード生成ベンチマーク
Authors: Jia Li, Hongyi Deng, Yiran Zhang, Kechi Zhang, Tianqi Shao, Tiankuo Zhao, Weinan Wang, Zhi Jin, Ge Li, Yang Liu, Yingtao Fang, Yihong Dong,
Abstract要約: コード生成にLLM(Large Language Models)を使用することで、研究者は大幅に進歩した。しかしながら、開発者は一般的に、生の自然言語記述ではなく、構造化された設計や仕様に基づいたコードを書く。既存のベンチマークと実際の産業開発プラクティスのギャップは、現在のベンチマークスコアが、どれだけのコード生成が開発タスクの自動化に役立つかを正確に反映していないことを意味する。
参考スコア（独自算出の注目度）: 54.956760584923295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and EvoCodeBench have been created to evaluate LLMs by requiring them to generate code from natural language requirements. However, in enterprise applications and team development, developers typically write code based on structured designs or specifications rather than raw natural language descriptions. This gap between existing benchmarks and real industry development practices means that current benchmark scores may not accurately reflect how much code generation can help automate software development tasks. To address this gap, we propose RealBench, a repository-level code generation benchmark aligned with real-world industry software development practices. Each example includes both natural language requirements and UML diagrams as system design, matching how developers typically receive specifications. Based on the constructed benchmarks, we conduct a systematic evaluation of advanced LLMs' code generation capabilities when provided with structured system designs. The experimental results reveal key insights in current LLMs' capabilities for repo-level code generation aligned with real-world software development practices. First, we notice that regarding repo-level code generation, LLMs show much worse performance and there are significant performance gaps among LLMs. Second, LLMs are good at finding and creating modules defined in UML diagrams, but the quality of generated modules is often poor due to grammar and logic errors. Third, generating the entire repository at once is the best generation strategy on smaller repositories, while generating a complex repository with the module-by-module strategy works better compared to other strategies.
Abstract（参考訳）: コードを書くには、ソフトウェア開発にかなりの時間と労力が必要です。このプロセスを自動化するために、研究者はLarge Language Models (LLMs) を使ってコード生成を行った。 HumanEvalやEvoCodeBenchといった多くのベンチマークは、自然言語要求からコードを生成することを要求することで、LLMを評価するために作成されている。しかしながら、エンタープライズアプリケーションやチーム開発では、開発者は一般的に、生の自然言語記述ではなく、構造化された設計や仕様に基づいたコードを書く。既存のベンチマークと実際の産業開発プラクティスのギャップは、現在のベンチマークスコアが、どの程度のコード生成がソフトウェア開発タスクの自動化に役立つかを正確に反映していないことを意味する。このギャップに対処するため、我々はRealBenchというレポジトリレベルのコード生成ベンチマークを提案します。それぞれの例には、自然言語要件とシステム設計としてのUMLダイアグラムの両方が含まれており、開発者が一般的に仕様を受信する方法と一致する。構築されたベンチマークに基づいて,構造化システムの設計を行う際に,高度なLCMのコード生成能力を体系的に評価する。実験結果は、実世界のソフトウェア開発プラクティスと整合したリポジトリレベルのコード生成に関する、現在のLLMの能力に関する重要な洞察を明らかにした。まず、レポレベルのコード生成に関して、LLMははるかにパフォーマンスが悪く、LLM間には大きなパフォーマンスギャップがあることに気付きます。第二に、LLMはUMLダイアグラムで定義されたモジュールの発見と作成に長けていますが、生成したモジュールの品質は文法やロジックのエラーのため、しばしば貧弱です。第3に、一度にリポジトリ全体を生成することは、小さなリポジトリで最高の生成戦略であると同時に、モジュール・バイ・モジュール戦略で複雑なリポジトリを生成することは、他の戦略よりもうまく機能します。

論文の概要: RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

関連論文リスト