Fugu-MT 論文翻訳(概要): ComBench: A Repo-level Real-world Benchmark for Compilation Error Repair

論文の概要: ComBench: A Repo-level Real-world Benchmark for Compilation Error Repair

arxiv url: http://arxiv.org/abs/2603.27333v1
Date: Sat, 28 Mar 2026 16:35:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.908038
Title: ComBench: A Repo-level Real-world Benchmark for Compilation Error Repair
Title（参考訳）: ComBench: コンパイルエラーの修正のためのリポジトリレベルの実世界のベンチマーク
Authors: Jia Li, Zeyang Zhuang, Zhuangbin Chen, Yuxin Su, Wei Meng, Michael R. Lyu,
Abstract要約: ComBenchは、C/C++コンパイルエラー修正のための最初のリポジトリレベルの再現可能な実世界のベンチマークである。 ComBenchは、GitHub CI履歴から現実の障害をマイニングする、新しい自動化フレームワークによって構築されている。本実験は,モデルが構文的正当性を達成する能力と,意味的正当性を保証する能力との間に有意なギャップがあることを明らかにする。
参考スコア（独自算出の注目度）: 36.10273400046946
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Compilation errors pose pervasive and critical challenges in software development, significantly hindering productivity. Therefore, Automated Compilation Error Repair (ACER) techniques are proposed to mitigate these issues. Despite recent advancements in ACER, its real-world performance remains poorly evaluated. This can be largely attributed to the limitations of existing benchmarks, \ie decontextualized single-file data, lack of authentic source diversity, and biased local task modeling that ignores crucial repository-level complexities. To bridge this critical gap, we propose ComBench, the first repository-level, reproducible real-world benchmark for C/C++ compilation error repair. ComBench is constructed through a novel, automated framework that systematically mines real-world failures from the GitHub CI histories of large-scale open-source projects. Our framework contributes techniques for the high-precision identification of ground-truth repair patches from complex version histories and a high-fidelity mechanism for reproducing the original, ephemeral build environments. To ensure data quality, all samples in ComBench are execution-verified -- guaranteeing reproducible failures and build success with ground-truth patches. Using ComBench, we conduct a comprehensive evaluation of 12 modern LLMs under both direct and agent-based repair settings. Our experiments reveal a significant gap between a model's ability to achieve syntactic correctness (a 73% success rate for GPT-5) and its ability to ensure semantic correctness (only 41% of its patches are valid). We also find that different models exhibit distinct specializations for different error types. ComBench provides a robust and realistic platform to guide the future development of ACER techniques capable of addressing the complexities of modern software development.
Abstract（参考訳）: コンパイルエラーは、ソフトウェア開発において広範囲で重要な課題を引き起こし、生産性を著しく妨げます。そのため、これらの問題を緩和するため、自動コンパイルエラー修復(ACER)技術が提案されている。 ACERの最近の進歩にもかかわらず、実際の性能は評価されていない。これは、既存のベンチマークの制限、\ieデコンテクスト化された単一ファイルデータ、真のソースの多様性の欠如、重要なリポジトリレベルの複雑さを無視したローカルタスクモデリングのバイアスによるところが大きい。この重要なギャップを埋めるため、我々はC/C++コンパイルエラー修正のための最初のリポジトリレベルの再現可能な実世界のベンチマークであるComBenchを提案する。 ComBenchは、大規模なオープンソースプロジェクトのGitHub CI履歴から現実の障害を体系的にマイニングする、新しい自動化フレームワークによって構築されている。本フレームワークは, 複雑なバージョン履歴からの地中構造修復パッチの高精度同定技術と, 初期ビルド環境を再現するための高忠実度機構に寄与する。データ品質を保証するため、ComBenchのすべてのサンプルは実行検証されている。 ComBench を用いて, 直接的およびエージェント的修復条件下で, 最新の LLM を総合的に評価する。実験の結果,構文的正当性(GPT-5の73%の成功率)と意味的正当性を保証する能力(パッチの41%が有効)との間には,大きなギャップがあることがわかった。また、異なるモデルが異なるエラータイプに対して異なる特殊化を示すことも判明した。 ComBenchは、現代のソフトウェア開発の複雑さに対処できるACER技術の将来の開発をガイドする、堅牢で現実的なプラットフォームを提供する。

論文の概要: ComBench: A Repo-level Real-world Benchmark for Compilation Error Repair

関連論文リスト