Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures
- URL: http://arxiv.org/abs/2402.13640v2
- Date: Wed, 28 Feb 2024 00:48:06 GMT
- Title: Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures
- Authors: Negar Alizadeh and Fernando Castor
- Abstract summary: It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
- Score: 56.200335252600354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Learning (DL) frameworks such as PyTorch and TensorFlow include runtime
infrastructures responsible for executing trained models on target hardware,
managing memory, data transfers, and multi-accelerator execution, if
applicable. Additionally, it is a common practice to deploy pre-trained models
on environments distinct from their native development settings. This led to
the introduction of interchange formats such as ONNX, which includes its
runtime infrastructure, and ONNX Runtime, which work as standard formats that
can be used across diverse DL frameworks and languages. Even though these
runtime infrastructures have a great impact on inference performance, no
previous paper has investigated their energy efficiency. In this study, we
monitor the energy consumption and inference time in the runtime
infrastructures of three well-known DL frameworks as well as ONNX, using three
various DL models. To have nuance in our investigation, we also examine the
impact of using different execution providers. We find out that the performance
and energy efficiency of DL are difficult to predict. One framework, MXNet,
outperforms both PyTorch and TensorFlow for the computer vision models using
batch size 1, due to efficient GPU usage and thus low CPU usage. However, batch
size 64 makes PyTorch and MXNet practically indistinguishable, while TensorFlow
is outperformed consistently. For BERT, PyTorch exhibits the best performance.
Converting the models to ONNX yields significant performance improvements in
the majority of cases. Finally, in our preliminary investigation of execution
providers, we observe that TensorRT always outperforms CUDA.
Related papers
- SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow.
SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns.
We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z) - Dynamic DNNs and Runtime Management for Efficient Inference on
Mobile/Embedded Devices [2.8851756275902476]
Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms.
We co-designed novel Dynamic Super-Networks to maximise system-level performance and energy efficiency.
Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency.
arXiv Detail & Related papers (2024-01-17T04:40:30Z) - Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity [12.663030430488922]
We propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference on high-performance Cores.
At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively.
arXiv Detail & Related papers (2023-09-19T03:20:02Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO)
TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models.
Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Automatic Tuning of Tensorflow's CPU Backend using Gradient-Free
Optimization Algorithms [0.6543507682026964]
Deep learning (DL) applications are built using DL libraries and frameworks such as Genetic and PyTorch.
These frameworks have complex parameters and tuning them to obtain good training and inference performance is challenging for typical users.
In this paper, we treat the problem of tuning parameters of DL frameworks to improve training and inference performance as a black-box problem.
arXiv Detail & Related papers (2021-09-13T19:10:23Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z) - Utilizing Ensemble Learning for Performance and Power Modeling and
Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks [0.0]
In this paper, we utilize ensemble learning to combine linear, nonlinear, and tree-/rule-based machine learning methods.
We use the datasets collected for two parallel cancer deep learning CANDLE benchmarks, NT3 and P1B2.
We achieve up to 61.15% performance improvement and up to 62.58% energy saving for P1B2 and up to 55.81% performance improvement and up to 52.60% energy saving for NT3 on up to 24,576 cores.
arXiv Detail & Related papers (2020-11-12T21:18:20Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.