Evaluating the Energy-Efficiency of the Code Generated by LLMs
- URL: http://arxiv.org/abs/2505.20324v1
- Date: Fri, 23 May 2025 18:13:27 GMT
- Title: Evaluating the Energy-Efficiency of the Code Generated by LLMs
- Authors: Md Arman Islam, Devi Varaprasad Jonnala, Ritika Rekhi, Pratik Pokharel, Siddharth Cilamkoti, Asif Imran, Tevfik Kosar, Bekir Turkkan,
- Abstract summary: This paper investigates the energy efficiency of the code generated by 20 popular Large Language Models for 878 programming problems.<n>Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code.<n>For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions.
- Score: 2.1983110147455482
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As the quality of code generated by Large Language Models (LLMs) improves, their adoption in the software industry for automated code generation continues to grow. Researchers primarily focus on enhancing the functional correctness of the generated code while commonly overlooking its energy efficiency and environmental impact. This paper investigates the energy efficiency of the code generated by 20 popular LLMs for 878 programming problems of varying difficulty levels and diverse algorithmic categories selected from the LeetCode platform by comparing them against canonical human-written solutions. Although LLMs can produce functionally correct results in most cases, our findings show that the performance and energy efficiency of LLM-produced solutions are often far below those of human-written solutions. Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code, whereas Grok-2 and Gemini-1.5-Pro are among the least energy-efficient models. On average, human-generated canonical solutions are approximately 1.17 times more energy efficient than DeepSeek-v3, 1.21 times more energy efficient than GPT-4o, and over 2 times more energy efficient than Grok-2 and Gemini-1.5-Pro. For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions.
Related papers
- On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z) - Can We Make Code Green? Understanding Trade-Offs in LLMs vs. Human Code Optimizations [45.243401722182554]
Large language models (LLMs) claim to assist developers in optimizing code for performance and energy efficiency.<n>This work focuses on software written in Matlab-widely used in both academia and industry for scientific and engineering applications.<n>We analyze energy-focused optimization on 400 scripts across 100 top GitHub repositories.
arXiv Detail & Related papers (2025-03-26T00:27:29Z) - AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code [45.77395425799378]
This paper presents the first study analyzing the energy efficiency and performance of LLM-generated code for three programming languages Python, Java, and C++.<n>Our results show that the models are much more successful in generating Python and Java than C++ code.
arXiv Detail & Related papers (2025-02-04T15:32:34Z) - PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback [78.89596149768458]
Large Language Models (LLMs) are widely adopted for assisting in software development tasks.<n>We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code.
arXiv Detail & Related papers (2024-11-18T06:22:38Z) - Rethinking Code Refinement: Learning to Judge Code Efficiency [60.04718679054704]
Large Language Models (LLMs) have demonstrated impressive capabilities in understanding and generating codes.
We propose a novel method based on the code language model that is trained to judge the efficiency between two different codes.
We validate our method on multiple programming languages with multiple refinement steps, demonstrating that the proposed method can effectively distinguish between more and less efficient versions of code.
arXiv Detail & Related papers (2024-10-29T06:17:37Z) - Large Language Models for Energy-Efficient Code: Emerging Results and Future Directions [2.848398051763324]
We propose a novel application of large language models (LLMs) as codes for energy efficiency.
We describe and evaluate a prototype, finding that over 6 small programs our system can improve energy efficiency in 3 of them, up to 2x better than compiler optimizations alone.
arXiv Detail & Related papers (2024-10-11T20:35:40Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types.
We show that even the most advanced LLMs fail to solve these problems end-to-end in text.
Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z) - How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark [39.13045037676502]
Development of large language models (LLMs) has significantly pushed the frontiers of program synthesis.<n>Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations.<n>We develop ENAMEL, a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code.
arXiv Detail & Related papers (2024-06-10T04:19:20Z) - A Controlled Experiment on the Energy Efficiency of the Source Code Generated by Code Llama [4.937787069991124]
83% of software developers use Large Language Models (LLMs) to generate code.
This paper assesses the energy efficiency of Code Llama with respect to human-written source code.
arXiv Detail & Related papers (2024-05-06T16:32:29Z) - On Evaluating the Efficiency of Source Code Generated by LLMs [31.8121544062256]
More efficient code can lead to higher performance and execution efficiency of programs and software completed by LLM-assisted programming.
First, we evaluate the efficiency of the code generated by LLMs on two benchmarks, HumanEval and MBPP.
Then, we choose a set of programming problems from the online judge platform LeetCode to conduct a more difficult evaluation.
arXiv Detail & Related papers (2024-04-09T05:59:39Z) - EffiBench: Benchmarking the Efficiency of Automatically Generated Code [16.19693502619949]
EffiBench is a benchmark with 1,000 efficiency-critical coding problems.<n>Each problem is paired with an executable human-written canonical solution.<n>We empirically examine the ability of 42 large language models to generate efficient code.
arXiv Detail & Related papers (2024-02-03T05:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.