Related papers: Towards Green AI: Decoding the Energy of LLM Inference in Software Development

Towards Green AI: Decoding the Energy of LLM Inference in Software Development

URL: http://arxiv.org/abs/2602.05712v1
Date: Thu, 05 Feb 2026 14:38:19 GMT
Title: Towards Green AI: Decoding the Energy of LLM Inference in Software Development
Authors: Lola Solovyeva, Fernando Castor,
Abstract summary: AI-assisted tools are increasingly integrated into software development, but their reliance on large language models (LLMs) introduces substantial computational and energy costs.<n>We conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state.
Score: 46.879983975894135
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Context: AI-assisted tools are increasingly integrated into software development workflows, but their reliance on large language models (LLMs) introduces substantial computational and energy costs. Understanding and reducing the energy footprint of LLM inference is therefore essential for sustainable software development. Objective: In this study, we conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state. Method: We investigate six 6B-7B and four 3B-4B transformer-based models, evaluating them on code-centric benchmarks HumanEval for code generation and LongBench for code understanding. Results: Our findings show that, within both parameter groups, models exhibit distinct energy patterns across phases. Furthermore, we observed that increases in prefill cost amplify the energy cost per token during decoding, with amplifications ranging from 1.3% to 51.8% depending on the model. Lastly, three out of ten models demonstrate babbling behavior, adding excessive content to the output that unnecessarily inflates energy consumption. We implemented babbling suppression for code generation, achieving energy savings ranging from 44% to 89% without affecting generation accuracy. Conclusion: These findings show that prefill costs influence decoding, which dominates energy consumption, and that babbling suppression can yield up to 89% energy savings. Reducing inference energy therefore requires both mitigating babbling behavior and limiting impact of prefill on decoding.

Related papers

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z)
Determining Energy Efficiency Sweet Spots in Production LLM Inference [1.633285971584668]
Existing approaches estimate energy consumption through simple linear functions of input and output sequence lengths.<n>We propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture.<n>Our results show that aligning sequence lengths with these efficiency "Sweet Spots" can substantially reduce energy usage.
arXiv Detail & Related papers (2026-02-05T14:21:00Z)
Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use [4.513690948889834]
Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference.<n>We show how emphsystem-level design choices can lead to orders-of-magnitude differences in energy consumption for the same model.<n>Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.
arXiv Detail & Related papers (2026-01-29T22:16:25Z)
Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation [50.21021246855702]
We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs)<n>Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps.<n>Our results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.
arXiv Detail & Related papers (2025-11-21T08:12:47Z)
Learning to Rank Chain-of-Thought: Using a Small Model [77.75522308463667]
This paper introduces the Energy Outcome Reward Model (EORM), a highly efficient, lightweight verifier designed to address this challenge.<n>EORM uses an energy-based framework to rank Chain-of-Thought (CoT) solutions, learning to distinguish correct from incorrect reasoning using only simple outcome labels.<n>With only 55M parameters, over 127 times smaller than typical reward models, EORM boosts the accuracy of Llama 3 8B to 90.7% on GSM8k and 63.7% on MATH.
arXiv Detail & Related papers (2025-05-21T01:06:29Z)
EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations [2.2765705959685234]
This study investigates the energy consumption of Discriminative and Generative AI models within real-world MLOps pipelines.<n>We employ software-based power measurements to ensure ease of replication across diverse configurations, models, and datasets.
arXiv Detail & Related papers (2025-03-31T10:28:04Z)
Prompt engineering and its implications on the energy consumption of Large Language Models [4.791072577881446]
Large language models (LLMs) in software engineering pose severe challenges regarding computational resources, data centers, and carbon emissions.<n>In this paper, we investigate how prompt engineering techniques (PETs) can impact the carbon emission of the Llama 3 model for the code generation task.
arXiv Detail & Related papers (2025-01-10T11:49:31Z)
Energy-Aware Dynamic Neural Inference [39.04688735618206]
We introduce an on-device adaptive inference system equipped with an energy-harvester and finite-capacity energy storage. We show that, as the rate of the ambient energy increases, energy- and confidence-aware control schemes show approximately 5% improvement in accuracy. We derive a principled policy with theoretical guarantees for confidence-aware and -agnostic controllers.
arXiv Detail & Related papers (2024-11-04T16:51:22Z)
A Comparative Study of Machine Learning Algorithms for Anomaly Detection in Industrial Environments: Performance and Environmental Impact [62.997667081978825]
This study seeks to address the demands of high-performance machine learning models with environmental sustainability. Traditional machine learning algorithms, such as Decision Trees and Random Forests, demonstrate robust efficiency and performance. However, superior outcomes were obtained with optimised configurations, albeit with a commensurate increase in resource consumption.
arXiv Detail & Related papers (2023-07-01T15:18:00Z)
Attention Mechanism with Energy-Friendly Operations [61.58748425876866]
We rethink attention mechanism from the energy consumption aspects. We build a novel attention model by replacing multiplications with either selective operations or additions. Empirical results on three machine translation tasks demonstrate that the proposed model achieves competitable accuracy.
arXiv Detail & Related papers (2022-04-28T08:50:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.