Energy consumption of code small language models serving with runtime engines and execution providers
- URL: http://arxiv.org/abs/2412.15441v1
- Date: Thu, 19 Dec 2024 22:44:02 GMT
- Title: Energy consumption of code small language models serving with runtime engines and execution providers
- Authors: Francisco Durán, Matias Martinez, Patricia Lago, Silverio Martínez-Fernández,
- Abstract summary: Small Language Models (SLMs) offer a promising solution to reduce resource demands.
Our goal is to analyze the impact of deep learning engines and execution providers on energy consumption, execution time, and computing-resource utilization.
- Score: 11.998900897003997
- License:
- Abstract: Background. The rapid growth of Language Models (LMs), particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing LMs inference for energy efficiency is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Aim. Our goal is to analyze the impact of deep learning runtime engines and execution providers on energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code SLMs. Method. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Results. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Conclusions. Serving configuration choice significantly impacts energy efficiency. While further research is needed, we recommend the above configurations best suited to software engineers' requirements for enhancing serving efficiency in energy and performance.
Related papers
- Large Language Models for Energy-Efficient Code: Emerging Results and Future Directions [2.848398051763324]
We propose a novel application of large language models (LLMs) as codes for energy efficiency.
We describe and evaluate a prototype, finding that over 6 small programs our system can improve energy efficiency in 3 of them, up to 2x better than compiler optimizations alone.
arXiv Detail & Related papers (2024-10-11T20:35:40Z) - Impact of ML Optimization Tactics on Greener Pre-Trained ML Models [46.78148962732881]
This study aims to (i) analyze image classification datasets and pre-trained models, (ii) improve inference efficiency by comparing optimized and non-optimized models, and (iii) assess the economic impact of the optimizations.
We conduct a controlled experiment to evaluate the impact of various PyTorch optimization techniques (dynamic quantization, torch.compile, local pruning, and global pruning) to 42 Hugging Face models for image classification.
Dynamic quantization demonstrates significant reductions in inference time and energy consumption, making it highly suitable for large-scale systems.
arXiv Detail & Related papers (2024-09-19T16:23:03Z) - Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z) - LitE-SNN: Designing Lightweight and Efficient Spiking Neural Network through Spatial-Temporal Compressive Network Search and Joint Optimization [48.41286573672824]
Spiking Neural Networks (SNNs) mimic the information-processing mechanisms of the human brain and are highly energy-efficient.
We propose a new approach named LitE-SNN that incorporates both spatial and temporal compression into the automated network design process.
arXiv Detail & Related papers (2024-01-26T05:23:11Z) - A Reinforcement Learning Approach for Performance-aware Reduction in
Power Consumption of Data Center Compute Nodes [0.46040036610482665]
We use Reinforcement Learning to design a power capping policy on cloud compute nodes.
We show how a trained agent running on actual hardware can take actions by balancing power consumption and application performance.
arXiv Detail & Related papers (2023-08-15T23:25:52Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Power Constrained Autotuning using Graph Neural Networks [1.7188280334580197]
We propose a novel Graph Neural Network based auto-tuning approach to improve the performance, power, and energy efficiency of scientific applications on modern processors.
Our approach identifies OpenMP configurations at different power constraints that yield a mean geometric performance improvement of more than $25%$ and $13%$ over the default OpenMP configuration.
arXiv Detail & Related papers (2023-02-22T16:06:00Z) - U-Boost NAS: Utilization-Boosted Differentiable Neural Architecture
Search [50.33956216274694]
optimizing resource utilization in target platforms is key to achieving high performance during DNN inference.
We propose a novel hardware-aware NAS framework that does not only optimize for task accuracy and inference latency, but also for resource utilization.
We achieve 2.8 - 4x speedup for DNN inference compared to prior hardware-aware NAS methods.
arXiv Detail & Related papers (2022-03-23T13:44:15Z) - Source Code Classification for Energy Efficiency in Parallel Ultra
Low-Power Microcontrollers [5.4352987210173955]
This paper aims at increasing smartness in the software toolchain to exploit modern architectures in the best way.
In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption.
Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation.
arXiv Detail & Related papers (2020-12-12T15:12:03Z) - The Case for Learning Application Behavior to Improve Hardware Energy
Efficiency [2.4425948078034847]
We propose to use the harvested knowledge to tune hardware configurations.
Our proposed approach, called FORECASTER, uses a deep learning model to learn what configuration of hardware resources provides the optimal energy efficiency for a certain behavior of an application.
Our results show that FORECASTER can save as much as 18.4% system power over the baseline set up with all resources.
arXiv Detail & Related papers (2020-04-27T18:11:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.