Determining Energy Efficiency Sweet Spots in Production LLM Inference
- URL: http://arxiv.org/abs/2602.05695v1
- Date: Thu, 05 Feb 2026 14:21:00 GMT
- Title: Determining Energy Efficiency Sweet Spots in Production LLM Inference
- Authors: Hiari Pizzini Cavagna, Andrea Proia, Giacomo Madella, Giovanni B. Esposito, Francesco Antici, Daniele Cesarini, Zeynep Kiziltan, Andrea Bartolini,
- Abstract summary: Existing approaches estimate energy consumption through simple linear functions of input and output sequence lengths.<n>We propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture.<n>Our results show that aligning sequence lengths with these efficiency "Sweet Spots" can substantially reduce energy usage.
- Score: 1.633285971584668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) inference is central in modern AI applications, making it critical to understand their energy footprint. Existing approaches typically estimate energy consumption through simple linear functions of input and output sequence lengths, yet our observations reveal clear Energy Efficiency regimes: peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs, indicating a non-linear dependency. In this work, we propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture, capable of accurately characterizing the efficiency curve as a function of input and output lengths. To assess its accuracy, we evaluate energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite, tested over input and output lengths from 64 to 4096 tokens, achieving a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "Sweet Spots" can substantially reduce energy usage, supporting informed truncation, summarization, and adaptive generation strategies in production systems.
Related papers
- Predicting LLM Output Length via Entropy-Guided Representations [13.351384070796747]
We introduce a lightweight framework that reuses the main model's internal hidden states for efficient length prediction.<n>Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction.
arXiv Detail & Related papers (2026-02-12T10:49:04Z) - Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use [4.513690948889834]
Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference.<n>We show how emphsystem-level design choices can lead to orders-of-magnitude differences in energy consumption for the same model.<n>Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.
arXiv Detail & Related papers (2026-01-29T22:16:25Z) - EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving [64.15371139980802]
Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP)<n>We show that different test-time scaling strategies for ATP models introduce significant computational overhead for inference.<n>We propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits.
arXiv Detail & Related papers (2025-09-16T03:00:13Z) - EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z) - Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes [1.2178992475191555]
Large language models (LLMs) are increasingly used for generating parallel scientific codes.<n>We propose LASSI-EE, an automated screening framework that generates energy-efficient parallel codes.<n>We introduce energy-reduction@k, a novel metric that quantifies expected energy reduction when generating k code candidates.
arXiv Detail & Related papers (2025-05-04T17:05:34Z) - R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency [6.306413686006502]
We conduct a comprehensive analysis of 28 quantized Large Language Models (LLMs) from the Ollama library.<n>We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types.<n>Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings.
arXiv Detail & Related papers (2025-04-04T11:29:30Z) - Accelerating Large Language Models through Partially Linear Feed-Forward Network [6.916037014703829]
Large language models (LLMs) demonstrate remarkable capabilities but face deployment challenges due to their massive parameter counts.<n>We present a novel perspective inspired by constant folding in compiler optimization.<n>We propose TARDIS, which enables optimization of LLMs with non-linear activations by partially approximating them with linear functions in frequently occurring input ranges.<n>Experiments demonstrate that TARDIS achieves 80% parameter reduction in feed-forward networks, while significantly outperforming state-of-the-art pruning methods Wanda and RIA with up to 65% higher accuracy.
arXiv Detail & Related papers (2025-01-17T09:20:56Z) - Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings [1.781045155774463]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing (NLP) tasks.<n>However, their inference workloads are computationally and energy intensive, raising concerns about sustainability and environmental impact.
arXiv Detail & Related papers (2025-01-14T16:02:33Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs)
LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.
LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.