Fugu-MT 論文翻訳(概要): OODEval: Evaluating Large Language Models on Object-Oriented Design

論文の概要: OODEval: Evaluating Large Language Models on Object-Oriented Design

arxiv url: http://arxiv.org/abs/2601.07602v1
Date: Mon, 12 Jan 2026 14:51:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 19:08:01.468579
Title: OODEval: Evaluating Large Language Models on Object-Oriented Design
Title（参考訳）: OODEval:オブジェクト指向設計における大規模言語モデルの評価
Authors: Bingxu Xiao, Yunwei Dong, Yiqi Tang, Manqing Zhang, Yifan Zhou, Chunyan Ma, Yepang Liu,
Abstract要約: オブジェクト指向設計タスクにおいて,29の大規模言語モデル (LLM) を評価する。トップパフォーマンスのLDMは、学部生の平均的なパフォーマンスとほぼ一致しているが、最高の人間デザイナーのレベルをはるかに下回っている。
参考スコア（独自算出の注目度）: 10.295093285299403
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large language models (LLMs) have driven extensive evaluations in software engineering. however, most prior work concentrates on code-level tasks, leaving software design capabilities underexplored. To fill this gap, we conduct a comprehensive empirical study evaluating 29 LLMs on object-oriented design (OOD) tasks. Owing to the lack of standardized benchmarks and metrics, we introduce OODEval, a manually constructed benchmark comprising 50 OOD tasks of varying difficulty, and OODEval-Human, the first human-rated OOD benchmark, which includes 940 undergraduate-submitted class diagrams evaluated by instructors. We further propose CLUE (Class Likeness Unified Evaluation), a unified metric set that assesses both global correctness and fine-grained design quality in class diagram generation. Using these benchmarks and metrics, we investigate five research questions: overall correctness, comparison with humans, model dimension analysis, task feature analysis, and bad case analysis. The results indicate that while LLMs achieve high syntactic accuracy, they exhibit substantial semantic deficiencies, particularly in method and relationship generation. Among the evaluated models, Qwen3-Coder-30B achieves the best overall performance, rivaling DeepSeek-R1 and GPT-4o, while Gemma3-4B-IT outperforms GPT-4o-Mini despite its smaller parameter scale. Although top-performing LLMs nearly match the average performance of undergraduates, they remain significantly below the level of the best human designers. Further analysis shows that parameter scale, code specialization, and instruction tuning strongly influence performance, whereas increased design complexity and lower requirement readability degrade it. Bad case analysis reveals common failure modes, including keyword misuse, missing classes or relationships, and omitted methods.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩は、ソフトウェア工学における広範な評価を促している。しかしながら、以前の作業のほとんどはコードレベルのタスクに集中しており、ソフトウェア設計能力は過小評価されています。このギャップを埋めるために、オブジェクト指向設計(OOD)タスクにおいて29のLLMを評価する総合的な実証的研究を行った。標準化されたベンチマークとメトリクスが欠如しているため、難易度が異なる50のOODタスクからなる手作業によるベンチマークであるOODEvalと、インストラクターが評価した940の学部受験クラス図を含む最初の人格評価OODベンチマークであるOODEval-Humanを紹介した。さらに,クラスダイアグラム生成における大域的正しさと細粒度設計品質の両方を評価する統一されたメトリックセットであるCLUE(Class Likeness Unified Evaluation)を提案する。これらのベンチマークと指標を用いて、全体的な正しさ、人間との比較、モデル次元分析、タスク特徴分析、そしてケース分析の5つの研究課題を調査する。その結果,LLMは構文的精度が高いが,特に手法や関係生成において,意味的欠陥がかなり大きいことが示唆された。評価モデルの中では、Qwen3-Coder-30BはDeepSeek-R1やGPT-4oと競合し、Gemma3-4B-ITはパラメータスケールが小さいにもかかわらずGPT-4o-Miniを上回っている。トップパフォーマンスのLDMは、学部生の平均的なパフォーマンスとほぼ一致しているが、最高の人間デザイナーのレベルをはるかに下回っている。さらに分析したところ、パラメータスケール、コード特殊化、命令チューニングが性能に強く影響を与えているのに対して、設計の複雑さが増し、要求の可読性が低下していることがわかった。悪いケース分析では、キーワードの誤用、クラスや関係の欠如、省略メソッドなど、一般的な障害モードが明らかになっている。

論文の概要: OODEval: Evaluating Large Language Models on Object-Oriented Design

関連論文リスト