Fugu-MT 論文翻訳(概要): Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study

論文の概要: Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study

arxiv url: http://arxiv.org/abs/2601.03780v1
Date: Wed, 07 Jan 2026 10:23:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 02:15:23.470395
Title: Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study
Title（参考訳）: プログラミング言語の知識単位(KU)を用いたコード生成ベンチマークの適応性の評価と改善 -実証的研究-
Authors: Md Ahasanuzzaman, Bram Adams, Emad Fallahzadeh, Gustavo A. Oliva, Ahmed E. Hassan,
Abstract要約: 大規模言語モデル(LLM)は、コード生成において素晴らしいパフォーマンスを示している。 LLMは幅広い言語概念を理解し、適用しなければならない。ベンチマークで実施される概念が現実世界のプロジェクトで使用される概念を代表していない場合、評価は不完全になる可能性がある。
参考スコア（独自算出の注目度）: 7.0773305889955616
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) such as GPT-4, Claude and LLaMA have shown impressive performance in code generation, typically evaluated using benchmarks (e.g., HumanEval). However, effective code generation requires models to understand and apply a wide range of language concepts. If the concepts exercised in benchmarks are not representative of those used in real-world projects, evaluations may yield incomplete. Despite this concern, the representativeness of code concepts in benchmarks has not been systematically examined. To address this gap, we present the first empirical study that analyzes the representativeness of code generation benchmarks through the lens of Knowledge Units (KUs) - cohesive sets of programming language capabilities provided by language constructs and APIs. We analyze KU coverage in two widely used Python benchmarks, HumanEval and MBPP, and compare them with 30 real-world Python projects. Our results show that each benchmark covers only half of the identified 20 KUs, whereas projects exercise all KUs with relatively balanced distributions. In contrast, benchmark tasks exhibit highly skewed KU distributions. To mitigate this misalignment, we propose a prompt-based LLM framework that synthesizes KU-based tasks to rebalance benchmark KU distributions and better align them with real-world usage. Using this framework, we generate 440 new tasks and augment existing benchmarks. The augmented benchmarks substantially improve KU coverage and achieve over a 60% improvement in distributional alignment. Evaluations of state-of-the-art LLMs on these augmented benchmarks reveal consistent and statistically significant performance drops (12.54-44.82%), indicating that existing benchmarks overestimate LLM performance due to their limited KU coverage. Our findings provide actionable guidance for building more realistic evaluations of LLM code-generation capabilities.
Abstract（参考訳）: GPT-4、Claude、LLaMAといった大規模言語モデル(LLM)は、一般的にベンチマーク(HumanEvalなど)を用いて評価される、コード生成において印象的なパフォーマンスを示している。しかし、効果的なコード生成には、幅広い言語概念を理解し、適用する必要がある。ベンチマークで実施される概念が現実世界のプロジェクトで使用される概念を代表していない場合、評価は不完全になる可能性がある。この懸念にもかかわらず、ベンチマークにおけるコード概念の代表性は体系的に検討されていない。このギャップに対処するために、我々は、言語構造とAPIによって提供されるプログラミング言語機能の密集したセットである知識ユニット(KU)を通して、コード生成ベンチマークの代表性を分析する、最初の実証的研究を提示する。我々は、広く使われている2つのPythonベンチマークであるHumanEvalとMBPPでKUカバレッジを分析し、それを30の現実のPythonプロジェクトと比較する。その結果、各ベンチマークは20KUの半数しかカバーしていないのに対し、プロジェクトは相対的にバランスの取れた分布を持つ全KUを処理していることがわかった。対照的に、ベンチマークタスクは高度に歪んだKU分布を示す。この不整合を緩和するため,KUベースのタスクを合成し,ベンチマークKU分布を再バランスさせ,実世界の利用と整合させる,プロンプトベースのLLMフレームワークを提案する。このフレームワークを使用して、440の新しいタスクを生成し、既存のベンチマークを拡張します。拡張ベンチマークはKUカバレッジを大幅に改善し、分散アライメントの60%以上を達成している。これらの拡張ベンチマークにおける最先端のLCMの評価は、一貫性があり統計的に有意な性能低下(12.54-44.82%)を示し、既存のベンチマークは、KUのカバレッジが限られているため、LLMのパフォーマンスを過大評価していることを示している。本研究は,LLM符号生成能力のより現実的な評価を構築するための実用的なガイダンスを提供する。

論文の概要: Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study

関連論文リスト