Related papers: OODEval: Evaluating Large Language Models on Object-Oriented Design

OODEval: Evaluating Large Language Models on Object-Oriented Design

URL: http://arxiv.org/abs/2601.07602v1
Date: Mon, 12 Jan 2026 14:51:31 GMT
Title: OODEval: Evaluating Large Language Models on Object-Oriented Design
Authors: Bingxu Xiao, Yunwei Dong, Yiqi Tang, Manqing Zhang, Yifan Zhou, Chunyan Ma, Yepang Liu,
Abstract summary: We evaluate 29 large language models (LLMs) on object-oriented design tasks.<n>Top-performing LLMs nearly match the average performance of undergraduates, but remain significantly below the level of the best human designers.
Score: 10.295093285299403
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large language models (LLMs) have driven extensive evaluations in software engineering. however, most prior work concentrates on code-level tasks, leaving software design capabilities underexplored. To fill this gap, we conduct a comprehensive empirical study evaluating 29 LLMs on object-oriented design (OOD) tasks. Owing to the lack of standardized benchmarks and metrics, we introduce OODEval, a manually constructed benchmark comprising 50 OOD tasks of varying difficulty, and OODEval-Human, the first human-rated OOD benchmark, which includes 940 undergraduate-submitted class diagrams evaluated by instructors. We further propose CLUE (Class Likeness Unified Evaluation), a unified metric set that assesses both global correctness and fine-grained design quality in class diagram generation. Using these benchmarks and metrics, we investigate five research questions: overall correctness, comparison with humans, model dimension analysis, task feature analysis, and bad case analysis. The results indicate that while LLMs achieve high syntactic accuracy, they exhibit substantial semantic deficiencies, particularly in method and relationship generation. Among the evaluated models, Qwen3-Coder-30B achieves the best overall performance, rivaling DeepSeek-R1 and GPT-4o, while Gemma3-4B-IT outperforms GPT-4o-Mini despite its smaller parameter scale. Although top-performing LLMs nearly match the average performance of undergraduates, they remain significantly below the level of the best human designers. Further analysis shows that parameter scale, code specialization, and instruction tuning strongly influence performance, whereas increased design complexity and lower requirement readability degrade it. Bad case analysis reveals common failure modes, including keyword misuse, missing classes or relationships, and omitted methods.

Related papers

Holistic Evaluation of State-of-the-Art LLMs for Code Generation [5.504955093712013]
DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness.<n>We identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms.
arXiv Detail & Related papers (2025-12-19T23:29:05Z)
When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs [0.0]
This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess instruction-following.<n>We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025.<n>Our findings reveal consistent failure modes and identify specific instruction types posing particular challenges.
arXiv Detail & Related papers (2025-10-18T16:33:15Z)
Sustainability via LLM Right-sizing [21.17523328451591]
Large language models (LLMs) have become increasingly embedded in organizational.<n>This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks.<n>Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint.
arXiv Detail & Related papers (2025-04-17T04:00:40Z)
IHEval: Evaluating Language Models on Following the Instruction Hierarchy [67.33509094445104]
The instruction hierarchy establishes a priority order from system messages to user messages, conversation history, and tool outputs.<n>Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy.<n>We bridge this gap by introducing IHEval, a novel benchmark covering cases where instructions in different priorities either align or conflict.
arXiv Detail & Related papers (2025-02-12T19:35:28Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models [42.72420855478716]
FollowEval benchmark is composed of instances in both English and Chinese. Each test example is designed to evaluate more than one dimension. We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans.
arXiv Detail & Related papers (2023-11-16T11:53:31Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.