How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference
- URL: http://arxiv.org/abs/2505.09598v3
- Date: Tue, 15 Jul 2025 18:07:27 GMT
- Title: How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference
- Authors: Nidhal Jegham, Marwan Abdelatti, Lassad Elmoubarki, Abdeltawab Hendawi,
- Abstract summary: This paper introduces a novel infrastructure-aware benchmarking framework for quantifying the environmental footprint of AI inference across 30 state-of-the-art models as deployed in commercial data centers.<n>Our results show that o3 and DeepSeek-R1 emerge as the most energy-intensive models, consuming over 33 Wh per long prompt, more than 70 times the consumption of GPT-4.1 nano, and that Claude-3.7 Sonnet ranks highest in eco-efficiency.<n>These findings illustrate a growing paradox: Although AI is becoming cheaper and faster, its global adoption drives disproportionate resource consumption.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a novel infrastructure-aware benchmarking framework for quantifying the environmental footprint of LLM inference across 30 state-of-the-art models as deployed in commercial data centers. Our framework combines public API performance data with region-specific environmental multipliers and statistical inference of hardware configurations. We additionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rank models by performance relative to environmental cost. Our results show that o3 and DeepSeek-R1 emerge as the most energy-intensive models, consuming over 33 Wh per long prompt, more than 70 times the consumption of GPT-4.1 nano, and that Claude-3.7 Sonnet ranks highest in eco-efficiency. While a single short GPT-4o query consumes 0.42 Wh, scaling this to 700 million queries/day results in substantial annual environmental impacts. These include electricity use comparable to 35,000 U.S. homes, freshwater evaporation matching the annual drinking needs of 1.2 million people, and carbon emissions requiring a Chicago-sized forest to offset. These findings illustrate a growing paradox: Although AI is becoming cheaper and faster, its global adoption drives disproportionate resource consumption. Our study provides a standardized, empirically grounded methodology for benchmarking the sustainability of LLM deployments, laying a foundation for future environmental accountability in AI development and sustainability standards.
Related papers
- Accurate and Energy Efficient: Local Retrieval-Augmented Generation Models Outperform Commercial Large Language Models in Medical Tasks [0.0]
Local Large Language Models (LLMs) can be leveraged to develop RAGs that outperform commercial, online LLMs in medical tasks.<n>Our modular framework promotes sustainable AI development, reducing electricity usage and aligning with the UNs Sustainable Development Goals.
arXiv Detail & Related papers (2025-06-24T20:56:03Z) - EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z) - Holistically Evaluating the Environmental Impact of Creating Language Models [26.846990296567267]
We estimate the real-world environmental impact of developing a series of language models, ranging from 20 million to 13 billion active parameters, trained on up to 5.6 trillion tokens each.<n>We find that our series of models released 493 metric tons of carbon emissions, equivalent to powering about 98 homes in the United States for one year.<n>We also find that power usage throughout training is not consistent, fluctuating between 15% and 85% of our hardware's maximum power draw.
arXiv Detail & Related papers (2025-03-03T22:16:15Z) - The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z) - Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment Domain [6.246205449407889]
Generative AI holds significant potential for ecological and environmental applications.<n>The Environmental Large Language model Evaluation (ELLE) dataset is the first benchmark designed to assess large language models.<n>ELLE dataset includes 1,130 question answer pairs across 16 environmental topics, categorized by domain, difficulty, and type.
arXiv Detail & Related papers (2025-01-10T12:48:29Z) - Software Frugality in an Accelerating World: the Case of Continuous Integration [2.73028688816111]
We conduct the first large-scale analysis of the energy footprint of Continuous Integration pipelines implemented with GitHub.
We observe that the average unitary energy cost of a pipeline is relatively low, at 10 Wh.
When evaluating CO2 emissions based on regional Wh-to-CO2 estimates, we observe that the average aggregated CO2 emissions are significant, averaging 10.5 kg.
arXiv Detail & Related papers (2024-10-21T09:29:50Z) - Reporting and Analysing the Environmental Impact of Language Models on the Example of Commonsense Question Answering with External Knowledge [7.419725234099729]
ChatGPT sparked social interest in Large Language Models (LLMs)
LLMs demand substantial computational resources and are very costly to train, both financially and environmentally.
In this study, we infused T5 LLM with external knowledge and fine-tuned the model for Question-Answering task.
arXiv Detail & Related papers (2024-07-24T16:16:16Z) - Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation [82.85015548989223]
Pentathlon is a benchmark for holistic and realistic evaluation of model efficiency.
Pentathlon focuses on inference, which accounts for a majority of the compute in a model's lifecycle.
It incorporates a suite of metrics that target different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption.
arXiv Detail & Related papers (2023-07-19T01:05:33Z) - A Comparative Study of Machine Learning Algorithms for Anomaly Detection
in Industrial Environments: Performance and Environmental Impact [62.997667081978825]
This study seeks to address the demands of high-performance machine learning models with environmental sustainability.
Traditional machine learning algorithms, such as Decision Trees and Random Forests, demonstrate robust efficiency and performance.
However, superior outcomes were obtained with optimised configurations, albeit with a commensurate increase in resource consumption.
arXiv Detail & Related papers (2023-07-01T15:18:00Z) - Towards Environmentally Equitable AI via Geographical Load Balancing [40.142341503145275]
This paper takes a first step toward addressing AI's environmental inequity by balancing its regional negative environmental impact.
We run trace-based simulations by considering a set of 10 geographically-distributed data centers that serve inference requests for a large language AI model.
The results demonstrate that existing GLB approaches may amplify environmental inequity while our proposed equity-aware GLB can significantly reduce the regional disparity in terms of carbon and water footprints.
arXiv Detail & Related papers (2023-06-20T17:13:33Z) - Green Federated Learning [7.003870178055125]
Federated Learning (FL) is a machine learning technique for training a centralized model using data of decentralized entities.
FL may leverage as many as hundreds of millions of globally distributed end-user devices with diverse energy sources.
We propose the concept of Green FL, which involves optimizing FL parameters and making design choices to minimize carbon emissions.
arXiv Detail & Related papers (2023-03-26T02:23:38Z) - Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language
Model [72.65502770895417]
We quantify the carbon footprint of BLOOM, a 176-billion parameter language model, across its life cycle.
We estimate that BLOOM's final training emitted approximately 24.7 tonnes ofcarboneqif we consider only the dynamic power consumption.
We conclude with a discussion regarding the difficulty of precisely estimating the carbon footprint of machine learning models.
arXiv Detail & Related papers (2022-11-03T17:13:48Z) - Eco2AI: carbon emissions tracking of machine learning models as the
first step towards sustainable AI [47.130004596434816]
In eco2AI we put emphasis on accuracy of energy consumption tracking and correct regional CO2 emissions accounting.
The motivation also comes from the concept of AI-based green house gases sequestrating cycle with both Sustainable AI and Green AI pathways.
arXiv Detail & Related papers (2022-07-31T09:34:53Z) - A Framework for Energy and Carbon Footprint Analysis of Distributed and
Federated Edge Learning [48.63610479916003]
This article breaks down and analyzes the main factors that influence the environmental footprint of distributed learning policies.
It models both vanilla and decentralized FL policies driven by consensus.
Results show that FL allows remarkable end-to-end energy savings (30%-40%) for wireless systems characterized by low bit/Joule efficiency.
arXiv Detail & Related papers (2021-03-18T16:04:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.