Fugu-MT 論文翻訳(概要): A Multi-Language Object-Oriented Programming Benchmark for Large Language Models

論文の概要: A Multi-Language Object-Oriented Programming Benchmark for Large Language Models

arxiv url: http://arxiv.org/abs/2509.26111v1
Date: Tue, 30 Sep 2025 11:30:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.520931
Title: A Multi-Language Object-Oriented Programming Benchmark for Large Language Models
Title（参考訳）: 大規模言語モデルのための多言語オブジェクト指向プログラミングベンチマーク
Authors: Shuai Wang, Liang Ding, Li Shen, Yong Luo, Han Hu, Lefei Zhang, Fu Lin,
Abstract要約: 35の既存ベンチマークの調査では、3つの大きな不均衡が明らかになった。 85.7%は単一のプログラミング言語に重点を置いている。 94.3%は関数レベルまたはステートメントレベルのタスクのみを対象としている。 80%以上は平均10件未満のテストケースを含む。
参考スコア（独自算出の注目度）: 61.267115598083315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Establishing fair and robust benchmarks is essential for evaluating intelligent code generation by large language models (LLMs). Our survey of 35 existing benchmarks uncovers three major imbalances: 85.7% focus on a single programming language; 94.3% target only function-level or statement-level tasks; and over 80% include fewer than ten test cases on average. To address these gaps, we propose MultiOOP, a multi-language object-oriented programming benchmark covering six popular languages (Python, PHP, C++, C#, Java, JavaScript) with 267 tasks per language. We design a translator that extends an existing single-language OOP benchmark and the pass@o metric to a multilingual setting. Moreover, we propose an automated framework for augmenting test cases to ensure the reliability of the evaluation results. We evaluate 14 mainstream LLMs under zero-shot prompting and report three key findings: 1) Substantial performance degradation: pass@1 scores on MultiOOP drop by up to 65.6 percentage points compared to function-level tasks (e.g., HumanEval). 2) Cross-language variability: GPT-4o mini achieves pass@1 of 48.06% in Python but only 0.12%-15.26% in other languages, indicating limited multilingual generalization. 3) Conceptual gaps: pass@o scores are consistently 1.1-19.2 points lower than pass@k, demonstrating that LLMs often generate executable code without fully capturing core OOP concepts. Our benchmark, metric extensions, and evaluation scripts will be publicly released to foster a more balanced and comprehensive assessment of LLMs in object-oriented code generation. Our code and data will be released at https://github.com/alphadl/OOP-eval and https://huggingface.co/datasets/codeai-dteam/MultiOOP respectively.
Abstract（参考訳）: 公正で堅牢なベンチマークを確立することは、大規模言語モデル(LLM)によるインテリジェントなコード生成を評価する上で不可欠である。 85.7%は1つのプログラミング言語に焦点を当てており、94.3%は関数レベルまたはステートメントレベルのタスクのみを対象としており、80%以上は平均で10件未満のテストケースを含んでいる。これらのギャップに対処するため、MultiOOPは、Python、PHP、C++、C#、Java、JavaScriptの6つの人気のある言語を1言語あたり267タスクでカバーする、多言語オブジェクト指向プログラミングベンチマークである。既存の単言語OOPベンチマークとpass@oメトリックを多言語設定に拡張するトランスレータを設計する。また,評価結果の信頼性を確保するために,テストケースを拡張するための自動フレームワークを提案する。ゼロショットプロンプトで14個の主要LCMを評価し,3つの重要な知見を報告する。 1) 機能レベルのタスク(例えば、HumanEval)と比較して、MultiOOPでのpass@1スコアは65.6ポイントまで減少します。 2) 言語間の可変性: GPT-4o miniはPythonで48.06%のパス@1を達成するが、他の言語では0.12%-15.26%しかなく、多言語一般化に制限がある。 3) 概念的ギャップ: pass@oスコアはpass@kよりも一貫して1.1-19.2ポイント低い。我々のベンチマーク、メトリック拡張、評価スクリプトは、オブジェクト指向コード生成におけるLCMのよりバランスよく包括的な評価を促進するために公開されます。私たちのコードとデータは、それぞれhttps://github.com/alphadl/OOP-evalとhttps://huggingface.co/datasets/codeai-dteam/MultiOOPでリリースされます。

論文の概要: A Multi-Language Object-Oriented Programming Benchmark for Large Language Models

関連論文リスト