Related papers: Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

URL: http://arxiv.org/abs/2407.11470v1
Date: Tue, 16 Jul 2024 08:08:48 GMT
Title: Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
Authors: Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun,
Abstract summary: This paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by large language models (LLMs) We design various types of user requirements for each dimension to assess the model's ability to generate correct code that also meets user demands. We evaluate 18 representative LLMs on RACE and find that the current LLMs' ability to generate high-quality code on demand does not yet meet the requirements of software development.
Score: 43.56644186785491
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, existing benchmarks primarily focus on assessing the correctness of code generated by LLMs, while neglecting other critical dimensions that also significantly impact code quality. Therefore, this paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency. Specifically, considering the demand-dependent nature of dimensions beyond correctness, we design various types of user requirements for each dimension to assess the model's ability to generate correct code that also meets user demands. We evaluate 18 representative LLMs on RACE and find that: 1) the current LLMs' ability to generate high-quality code on demand does not yet meet the requirements of software development; 2) readability serves as a critical indicator of the overall quality of generated code; 3) most LLMs exhibit an inherent preference for specific coding style. These findings can help researchers gain a deeper understanding of the coding capabilities of current LLMs and shed light on future directions for model improvement.

Related papers

Is LLM-Generated Code More Maintainable \& Reliable than Human-Written Code? [4.893345190925178]
This study compares the internal quality attributes of LLM-generated and human-written code.<n>Our analysis shows that LLM-generated code has fewer bugs and requires less effort to fix them overall.
arXiv Detail & Related papers (2025-08-01T15:17:34Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation [19.071855537400463]
Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. CoCo-Bench is designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review.
arXiv Detail & Related papers (2025-04-29T11:57:23Z)
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing [70.35888047551643]
We present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2326 test cases across four practical QA task categories and three types of naturally occurring long texts. We find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks.
arXiv Detail & Related papers (2025-02-14T08:04:22Z)
Correctness Assessment of Code Generated by Large Language Models Using Internal Representations [4.32362000083889]
We introduce OPENIA, a novel framework to assess the correctness of code generated by Large Language Models (LLMs) Our empirical analysis reveals that these internal representations encode latent information, which strongly correlates with the correctness of the generated code. OPENIA consistently outperforms baseline models, achieving higher accuracy, precision, recall, and F1-Scores with up to a 2X improvement in standalone code generation.
arXiv Detail & Related papers (2025-01-22T15:04:13Z)
A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. It is constructed using a dataset curated from 30 well-known GitHub repositories. We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
Precision or Peril: Evaluating Code Quality from Quantized Large Language Models [0.5249805590164902]
Quantization has emerged as a way to mitigate the memory overhead of Large Language Models. This study aims to evaluate the current code generation capabilities of smaller LLMs using various metrics.
arXiv Detail & Related papers (2024-11-16T01:31:29Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
LLM-CI: Assessing Contextual Integrity Norms in Language Models [1.1715858161748576]
Large language models (LLMs) may inadvertently encode societal preferences and norms. This is especially challenging due to prompt sensitivity$-$small variations in prompts yield different responses. We present LLM-CI, the first open-sourced framework to assess encoded norms.
arXiv Detail & Related papers (2024-09-05T17:50:31Z)
A Survey on Evaluating Large Language Models in Code Generation Tasks [30.256255254277914]
This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation.
arXiv Detail & Related papers (2024-08-29T12:56:06Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames. It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values. Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z)
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models [43.655927559990616]
We propose CodeApex, a benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs. We evaluate 12 widely used LLMs, including both general-purpose and specialized models. GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively.
arXiv Detail & Related papers (2023-09-05T04:12:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.