On Evaluating the Efficiency of Source Code Generated by LLMs
- URL: http://arxiv.org/abs/2404.06041v1
- Date: Tue, 9 Apr 2024 05:59:39 GMT
- Title: On Evaluating the Efficiency of Source Code Generated by LLMs
- Authors: Changan Niu, Ting Zhang, Chuanyi Li, Bin Luo, Vincent Ng,
- Abstract summary: More efficient code can lead to higher performance and execution efficiency of programs and software completed by LLM-assisted programming.
First, we evaluate the efficiency of the code generated by LLMs on two benchmarks, HumanEval and MBPP.
Then, we choose a set of programming problems from the online judge platform LeetCode to conduct a more difficult evaluation.
- Score: 31.8121544062256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have seen the remarkable capabilities of large language models (LLMs) for code generation. Different from existing work that evaluate the correctness of the code generated by LLMs, we propose to further evaluate its efficiency. More efficient code can lead to higher performance and execution efficiency of programs and software completed by LLM-assisted programming. First, we evaluate the efficiency of the code generated by LLMs on two benchmarks, HumanEval and MBPP. Then, we choose a set of programming problems from the online judge platform LeetCode to conduct a more difficult evaluation. Finally, we explore several prompts that would enable LLMs to generate more efficient code.
Related papers
- Rethinking Code Refinement: Learning to Judge Code Efficiency [60.04718679054704]
Large Language Models (LLMs) have demonstrated impressive capabilities in understanding and generating codes.
We propose a novel method based on the code language model that is trained to judge the efficiency between two different codes.
We validate our method on multiple programming languages with multiple refinement steps, demonstrating that the proposed method can effectively distinguish between more and less efficient versions of code.
arXiv Detail & Related papers (2024-10-29T06:17:37Z) - AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z) - A Survey on Evaluating Large Language Models in Code Generation Tasks [30.256255254277914]
This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks.
With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation.
arXiv Detail & Related papers (2024-08-29T12:56:06Z) - EPiC: Cost-effective Search-based Prompt Engineering of LLMs for Code Generation [8.009881267479189]
Large Language Models (LLMs) have seen increasing use in various software development tasks, especially in code generation.
We propose an alternative approach named Evolutionary Prompt Engineering for Code (EPiC) to evolve the original prompts toward better ones that produce high-quality code.
Our evaluation against state-of-the-art (SOTA) LLM-based code generation models shows that EPiC outperforms all the baselines in terms of cost-effectiveness.
arXiv Detail & Related papers (2024-08-20T21:15:36Z) - A Performance Study of LLM-Generated Code on Leetcode [1.747820331822631]
This study evaluates the efficiency of code generation by Large Language Models (LLMs)
We compare 18 LLMs, considering factors such as model temperature and success rate, and their impact on code performance.
We find that LLMs are capable of generating code that is, on average, more efficient than the code written by humans.
arXiv Detail & Related papers (2024-07-31T13:10:03Z) - How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark [39.13045037676502]
Development of large language models (LLMs) has significantly pushed the frontiers of program synthesis.
Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations.
We develop ENAMEL, a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code.
arXiv Detail & Related papers (2024-06-10T04:19:20Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models [43.655927559990616]
We propose CodeApex, a benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs.
We evaluate 12 widely used LLMs, including both general-purpose and specialized models.
GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively.
arXiv Detail & Related papers (2023-09-05T04:12:01Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.