Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute
- URL: http://arxiv.org/abs/2509.20241v1
- Date: Wed, 24 Sep 2025 15:32:01 GMT
- Title: Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute
- Authors: Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, Juan M. Lavista Ferres,
- Abstract summary: Non-production estimates and assumptions can overstate energy use by 4-20x.<n>We quantify achievable efficiency gains at the model, serving platform, and hardware levels.<n>We estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day.
- Score: 4.8312457834136175
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: As AI inference scales to billions of queries and emerging reasoning and agentic workflows increase token demand, reliable estimates of per-query energy use are increasingly important for capacity planning, emissions accounting, and efficiency prioritization. Many public estimates are inconsistent and overstate energy use, because they extrapolate from limited benchmarks and fail to reflect efficiency gains achievable at scale. In this perspective, we introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, we estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh, indicating that targeting efficiency in this regime will deliver the largest fleet-wide savings. We quantify achievable efficiency gains at the model, serving platform, and hardware levels, finding individual median reductions of 1.5-3.5x in energy per query, while combined advances can plausibly deliver 8-20x reductions. To illustrate the system-level impact, we estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day. If 10% are long queries, demand could grow to 1.8 GWh/day. With targeted efficiency interventions, it falls to 0.9 GWh/day, similar to the energy footprint of web search at that scale. This echoes how data centers historically tempered energy growth through efficiency gains during the internet and cloud build-up.
Related papers
- Towards Green AI: Decoding the Energy of LLM Inference in Software Development [46.879983975894135]
AI-assisted tools are increasingly integrated into software development, but their reliance on large language models (LLMs) introduces substantial computational and energy costs.<n>We conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state.
arXiv Detail & Related papers (2026-02-05T14:38:19Z) - Green MLOps: Closed-Loop, Energy-Aware Inference with NVIDIA Triton, FastAPI, and Bio-Inspired Thresholding [0.0]
Bio-inspired framework maps protein-folding energy basins to inference cost landscapes.<n>A request is admitted only when the expected utility-to-energy trade-off is favorable.<n>Results connect biophysical energy models to Green MLORTOps and offer a practical, auditable basis for closed-loop energy-aware inference in production.
arXiv Detail & Related papers (2026-01-06T15:50:11Z) - Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation [50.21021246855702]
We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs)<n>Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps.<n>Our results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.
arXiv Detail & Related papers (2025-11-21T08:12:47Z) - EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z) - How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference [0.0]
This paper introduces a novel infrastructure-aware benchmarking framework for quantifying the environmental footprint of AI inference across 30 state-of-the-art models as deployed in commercial data centers.<n>Our results show that o3 and DeepSeek-R1 emerge as the most energy-intensive models, consuming over 33 Wh per long prompt, more than 70 times the consumption of GPT-4.1 nano, and that Claude-3.7 Sonnet ranks highest in eco-efficiency.<n>These findings illustrate a growing paradox: Although AI is becoming cheaper and faster, its global adoption drives disproportionate resource consumption.
arXiv Detail & Related papers (2025-05-14T17:47:00Z) - Value-Based Deep RL Scales Predictably [100.21834069400023]
We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior.<n>We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
arXiv Detail & Related papers (2025-02-06T18:59:47Z) - Power Hungry Processing: Watts Driving the Cost of AI Deployment? [74.19749699665216]
generative, multi-purpose AI systems promise a unified approach to building machine learning (ML) models into technology.
This ambition of generality'' comes at a steep cost to the environment, given the amount of energy these systems require and the amount of carbon that they emit.
We measure deployment cost as the amount of energy and carbon required to perform 1,000 inferences on representative benchmark dataset using these models.
We conclude with a discussion around the current trend of deploying multi-purpose generative ML systems, and caution that their utility should be more intentionally weighed against increased costs in terms of energy and emissions
arXiv Detail & Related papers (2023-11-28T15:09:36Z) - Energy Efficient Deep Multi-Label ON/OFF Classification of Low Frequency Metered Home Appliances [0.16777183511743468]
Non-intrusive load monitoring (NILM) is the process of obtaining appliance-level data from a single metering point.
We introduce a novel DL model aimed at enhanced multi-label classification of NILM with improved computation and energy efficiency.
Compared to the state-of-the-art, the proposed model has its energy consumption reduced by more than 23%.
arXiv Detail & Related papers (2023-07-18T13:23:23Z) - PhAST: Physics-Aware, Scalable, and Task-specific GNNs for Accelerated
Catalyst Design [102.9593507372373]
Catalyst materials play a crucial role in the electrochemical reactions involved in industrial processes.
Machine learning holds the potential to efficiently model materials properties from large amounts of data.
We propose task-specific innovations applicable to most architectures, enhancing both computational efficiency and accuracy.
arXiv Detail & Related papers (2022-11-22T05:24:30Z) - Carbon Emissions and Large Neural Network Training [19.233899715628073]
We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3.
We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e)
To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models.
arXiv Detail & Related papers (2021-04-21T04:44:25Z) - WattScale: A Data-driven Approach for Energy Efficiency Analytics of
Buildings at Scale [2.771897351607068]
Buildings consume over 40% of the total energy in modern societies.
We present textttWattScale, a data-driven approach to identify the least energy-efficient buildings.
arXiv Detail & Related papers (2020-07-02T20:45:33Z) - Multi-Agent Meta-Reinforcement Learning for Self-Powered and Sustainable
Edge Computing Systems [87.4519172058185]
An effective energy dispatch mechanism for self-powered wireless networks with edge computing capabilities is studied.
A novel multi-agent meta-reinforcement learning (MAMRL) framework is proposed to solve the formulated problem.
Experimental results show that the proposed MAMRL model can reduce up to 11% non-renewable energy usage and by 22.4% the energy cost.
arXiv Detail & Related papers (2020-02-20T04:58:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.