Related papers: Benchmarking Energy Efficiency of Large Language Models Using vLLM

Benchmarking Energy Efficiency of Large Language Models Using vLLM

URL: http://arxiv.org/abs/2509.08867v1
Date: Wed, 10 Sep 2025 11:03:08 GMT
Title: Benchmarking Energy Efficiency of Large Language Models Using vLLM
Authors: K. Pronk, Q. Zhao,
Abstract summary: We introduce the LLM Efficiency Benchmark, designed to simulate real-world usage conditions.<n>We examine how factors such as model size, architecture, and concurrent request volume affect inference energy efficiency.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The prevalence of Large Language Models (LLMs) is having an growing impact on the climate due to the substantial energy required for their deployment and use. To create awareness for developers who are implementing LLMs in their products, there is a strong need to collect more information about the energy efficiency of LLMs. While existing research has evaluated the energy efficiency of various models, these benchmarks often fall short of representing realistic production scenarios. In this paper, we introduce the LLM Efficiency Benchmark, designed to simulate real-world usage conditions. Our benchmark utilizes vLLM, a high-throughput, production-ready LLM serving backend that optimizes model performance and efficiency. We examine how factors such as model size, architecture, and concurrent request volume affect inference energy efficiency. Our findings demonstrate that it is possible to create energy efficiency benchmarks that better reflect practical deployment conditions, providing valuable insights for developers aiming to build more sustainable AI systems.

Related papers

Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression [53.39128997308138]
We introduce information capacity, a measure of model efficiency based on text compression performance.<n> Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity.<n>A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts.
arXiv Detail & Related papers (2025-11-11T10:07:32Z)
Fun-ASR Technical Report [89.84148151617022]
We present Fun-ASR, a large-scale, LLM-based ASR system that combines massive data, large model capacity, LLM integration, and reinforcement learning.<n>Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements.<n>Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
arXiv Detail & Related papers (2025-09-15T23:19:36Z)
Comparing energy consumption and accuracy in text classification inference [0.9208007322096533]
This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference.<n>The best-performing model in terms of accuracy can also be energy-efficient, while larger LLMs tend to consume significantly more energy with lower classification accuracy.
arXiv Detail & Related papers (2025-08-19T18:00:08Z)
Energy Considerations of Large Language Model Inference and Efficiency Optimizations [28.55549828393871]
As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise.<n>We systematically analyze the energy implications of common inference efficiency optimizations across diverse NLP and AI workloads.<n>Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines.
arXiv Detail & Related papers (2025-04-24T15:45:05Z)
Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations [2.2765705959685234]
This study investigates the energy consumption of Discriminative and Generative AI models within real-world MLOps pipelines.<n>We employ software-based power measurements to ensure ease of replication across diverse configurations, models, and datasets.
arXiv Detail & Related papers (2025-03-31T10:28:04Z)
Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings [1.5749416770494706]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing (NLP) tasks.<n>However, their inference workloads are computationally and energy intensive, raising concerns about sustainability and environmental impact.
arXiv Detail & Related papers (2025-01-14T16:02:33Z)
Densing Law of LLMs [81.06644243978101]
Large Language Models (LLMs) have emerged as a milestone in artificial intelligence, and their performance can improve as the model size increases.<n>This paper introduces the concept of textitcapacity density'' as a new metric to evaluate the quality of the LLMs across different scales.
arXiv Detail & Related papers (2024-12-05T16:31:13Z)
Impact of ML Optimization Tactics on Greener Pre-Trained ML Models [46.78148962732881]
This study aims to (i) analyze image classification datasets and pre-trained models, (ii) improve inference efficiency by comparing optimized and non-optimized models, and (iii) assess the economic impact of the optimizations. We conduct a controlled experiment to evaluate the impact of various PyTorch optimization techniques (dynamic quantization, torch.compile, local pruning, and global pruning) to 42 Hugging Face models for image classification. Dynamic quantization demonstrates significant reductions in inference time and energy consumption, making it highly suitable for large-scale systems.
arXiv Detail & Related papers (2024-09-19T16:23:03Z)
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency [7.073435885680335]
We propose DynamoLLM, the first energy-management framework for generative large language models. At a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer.
arXiv Detail & Related papers (2024-08-01T17:40:45Z)
Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference [6.68507515624183]
Energy availability has come to the forefront as the biggest challenge for data center expansion to serve large language models. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient.
arXiv Detail & Related papers (2024-03-29T17:22:48Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation [82.85015548989223]
Pentathlon is a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model's lifecycle. It incorporates a suite of metrics that target different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption.
arXiv Detail & Related papers (2023-07-19T01:05:33Z)
Unlocking the Potential of User Feedback: Leveraging Large Language Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model. This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models. Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.