Holistic Evaluation of State-of-the-Art LLMs for Code Generation
- URL: http://arxiv.org/abs/2512.18131v1
- Date: Fri, 19 Dec 2025 23:29:05 GMT
- Title: Holistic Evaluation of State-of-the-Art LLMs for Code Generation
- Authors: Le Zhang, Suresh Kothari,
- Abstract summary: DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness.<n>We identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms.
- Score: 5.504955093712013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode problems across five programming languages, we assess model performance using rigorous metrics: compile-time errors, runtime errors, functional failures, and algorithmic suboptimalities. The results reveal significant performance variations, with DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness. Through detailed case studies, we identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms, highlighting the critical role of prompt engineering and human oversight in improving results. Based on these findings, we provide actionable recommendations for developers and practitioners, emphasizing that successful LLM deployment depends on careful model selection, effective prompt design, and context-aware usage to ensure reliable code generation in real-world software development tasks.
Related papers
- Readability-Robust Code Summarization via Meta Curriculum Learning [53.44612630063336]
In the real world, code is often poorly structured or obfuscated, significantly degrading model performance.<n>We propose RoFTCodeSum, a novel fine-tuning method that enhances the robustness of code summarization against poorly readable code.
arXiv Detail & Related papers (2026-01-09T02:38:24Z) - Large Language Model enabled Mathematical Modeling [2.132096006921049]
This research investigates the potential of Large Language Models (LLMs) to bridge the formulation gap using natural language understanding and code generation.<n>DeepSeek-R1 is a cost-efficient and high-performing model trained with reinforcement learning.<n>Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies.
arXiv Detail & Related papers (2025-10-22T17:41:42Z) - Leveraging Test Driven Development with Large Language Models for Reliable and Verifiable Spreadsheet Code Generation: A Research Framework [0.0]
This position paper proposes a structured research framework that integrates the proven software engineering practice of Test-Driven Development (TDD) with Large Language Model (LLM) driven generation.<n>By emphasising test driven thinking, we aim to improve computational thinking, prompt engineering skills, and user engagement.
arXiv Detail & Related papers (2025-10-17T12:28:16Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Language Models for Code Optimization: Survey, Challenges and Future Directions [7.928856221466083]
Language models (LMs) built upon deep neural networks (DNNs) have recently demonstrated breakthrough effectiveness in software engineering tasks.<n>This study aims to provide actionable insights and references for both researchers and practitioners in this rapidly evolving field.
arXiv Detail & Related papers (2025-01-02T14:20:36Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models [95.96734086126469]
Large language models (LLMs) can serve as the assistant to help users accomplish their jobs, and also support the development of advanced applications.
For the wide application of LLMs, the inference efficiency is an essential concern, which has been widely studied in existing work.
We perform a detailed coarse-to-fine analysis of the inference performance of various code libraries.
arXiv Detail & Related papers (2024-04-17T15:57:50Z) - Exploring Data-Efficient Adaptation of Large Language Models for Code Generation [64.5583894165813]
We propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation.<n> Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data.
arXiv Detail & Related papers (2024-02-29T16:09:02Z) - LLM4TDD: Best Practices for Test Driven Development Using Large Language
Models [0.76146285961466]
This paper explores the concept of LLM4TDD, where we guide Large Language Models to generate code iteratively using a test-driven development methodology.
We conduct an empirical evaluation using ChatGPT and coding problems from LeetCode to investigate the impact of different test, prompt and problem attributes on the efficacy of LLM4TDD.
arXiv Detail & Related papers (2023-12-07T20:37:54Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.