Energy-Aware Code Generation with LLMs: Benchmarking Small vs. Large Language Models for Sustainable AI Programming
- URL: http://arxiv.org/abs/2508.08332v1
- Date: Sun, 10 Aug 2025 14:44:06 GMT
- Title: Energy-Aware Code Generation with LLMs: Benchmarking Small vs. Large Language Models for Sustainable AI Programming
- Authors: Humza Ashraf, Syed Muhammad Danish, Aris Leivadeas, Yazan Otoum, Zeeshan Sattar,
- Abstract summary: We evaluate open-source Small Language Models (SLMs) trained explicitly for code generation against large Large Language Models (LLMs) and efficient human-written Python code.<n>We evaluate 150 coding problems from LeetCode, evenly distributed across three difficulty levels: easy, medium, and hard.<n>LLMs achieve the highest correctness across all difficulty levels, but SLMs are often more energy-efficient when their outputs are correct.
- Score: 2.588812622437082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are widely used for code generation. However, commercial models like ChatGPT require significant computing power, which leads to high energy use and carbon emissions. This has raised concerns about their environmental impact. In this study, we evaluate open-source Small Language Models (SLMs) trained explicitly for code generation and compare their performance and energy efficiency against large LLMs and efficient human-written Python code. The goal is to investigate whether SLMs can match the performance of LLMs on certain types of programming problems while producing more energy-efficient code. We evaluate 150 coding problems from LeetCode, evenly distributed across three difficulty levels: easy, medium, and hard. Our comparison includes three small open-source models, StableCode-3B, StarCoderBase-3B, and Qwen2.5-Coder-3B-Instruct, and two large commercial models, GPT-4.0 and DeepSeek-Reasoner. The generated code is evaluated using four key metrics: run-time, memory usage, energy consumption, and correctness. We use human-written solutions as a baseline to assess the quality and efficiency of the model-generated code. Results indicate that LLMs achieve the highest correctness across all difficulty levels, but SLMs are often more energy-efficient when their outputs are correct. In over 52% of the evaluated problems, SLMs consumed the same or less energy than LLMs.
Related papers
- Toward Green Code: Prompting Small Language Models for Energy-Efficient Code Generation [0.5486463492959637]
There is a growing concern about the environmental impact of large language models (LLMs) in software development.<n>This study investigates whether prompt engineering can improve the energy efficiency of SLMs in code generation.<n>We evaluate four open-source SLMs, StableCode-Instruct-3B, Qwen2.5-Coder-3B-Instruct, CodeLlama-7B-Instruct, and Phi-3-Mini-4K-Instruct, across 150 Python problems from LeetCode.
arXiv Detail & Related papers (2025-09-12T03:38:15Z) - On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z) - Evaluating the Energy-Efficiency of the Code Generated by LLMs [2.1983110147455482]
This paper investigates the energy efficiency of the code generated by 20 popular Large Language Models for 878 programming problems.<n>Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code.<n>For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions.
arXiv Detail & Related papers (2025-05-23T18:13:27Z) - Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language.<n>LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint.<n>New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z) - AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code [45.77395425799378]
This paper presents the first study analyzing the energy efficiency and performance of LLM-generated code for three programming languages Python, Java, and C++.<n>Our results show that the models are much more successful in generating Python and Java than C++ code.
arXiv Detail & Related papers (2025-02-04T15:32:34Z) - GREEN-CODE: Learning to Optimize Energy Efficiency in LLM-based Code Generation [1.5749416770494706]
This work proposes a framework for energy-aware code generation in Large Language Models (LLMs)<n>We train a Reinforcement Learning (RL) agent that learns to balance the trade-offs between accuracy, latency, and energy consumption.<n>Results show that our method reduces the energy consumption between 23-50 % on average for code generation tasks without significantly affecting accuracy.
arXiv Detail & Related papers (2025-01-19T10:44:03Z) - PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback [78.89596149768458]
Large Language Models (LLMs) are widely adopted for assisting in software development tasks.<n>We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code.
arXiv Detail & Related papers (2024-11-18T06:22:38Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - A Controlled Experiment on the Energy Efficiency of the Source Code Generated by Code Llama [4.937787069991124]
83% of software developers use Large Language Models (LLMs) to generate code.
This paper assesses the energy efficiency of Code Llama with respect to human-written source code.
arXiv Detail & Related papers (2024-05-06T16:32:29Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - Exploring Data-Efficient Adaptation of Large Language Models for Code Generation [64.5583894165813]
We propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation.<n> Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data.
arXiv Detail & Related papers (2024-02-29T16:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.