Related papers: Benchmarking Resource Usage for Efficient Distributed Deep Learning

Benchmarking Resource Usage for Efficient Distributed Deep Learning

URL: http://arxiv.org/abs/2201.12423v1
Date: Fri, 28 Jan 2022 21:24:15 GMT
Title: Benchmarking Resource Usage for Efficient Distributed Deep Learning
Authors: Nathan C. Frey, Baolin Li, Joseph McDonald, Dan Zhao, Michael Jones, David Bestor, Devesh Tiwari, Vijay Gadepally, Siddharth Samsi
Abstract summary: We conduct over 3,400 experiments training an array of deep networks representing various domains/tasks. We fit power law models that describe how training time scales with available compute resources and energy constraints.
Score: 10.869092085691687
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. Neural architecture searches, hyperparameter sweeps, and rapid prototyping consume immense resources that can prevent resource-constrained researchers from experimenting with large models and carry considerable environmental impact. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources -- especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks -- natural language processing, computer vision, and chemistry -- on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy-saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.

Related papers

ssProp: Energy-Efficient Training for Convolutional Neural Networks with Scheduled Sparse Back Propagation [4.77407121905745]
Back-propagation (BP) is a major source of computational expense during training deep learning models. We propose a general, energy-efficient convolution module that can be seamlessly integrated into any deep learning architecture.
arXiv Detail & Related papers (2024-08-22T17:22:59Z)
Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey [48.06362354403557]
This survey reviews the literature, mainly from 2019 to 2024, on efficient resource allocation and workload scheduling strategies for large-scale distributed DL. We highlight critical challenges for each topic and discuss key insights of existing technologies. This survey aims to encourage computer science, artificial intelligence, and communications researchers to understand recent advances.
arXiv Detail & Related papers (2024-06-12T11:51:44Z)
Investigation of Energy-efficient AI Model Architectures and Compression Techniques for "Green" Fetal Brain Segmentation [42.52549987351643]
Fetal brain segmentation in medical imaging is challenging due to the small size of the fetal brain and the limited image quality of fast 2D sequences. Deep neural networks are a promising method to overcome this challenge. Our study aims to explore model architectures and compression techniques that promote energy efficiency.
arXiv Detail & Related papers (2024-04-03T15:11:53Z)
Mechanistic Neural Networks for Scientific Machine Learning [58.99592521721158]
We present Mechanistic Neural Networks, a neural network design for machine learning applications in the sciences. It incorporates a new Mechanistic Block in standard architectures to explicitly learn governing differential equations as representations. Central to our approach is a novel Relaxed Linear Programming solver (NeuRLP) inspired by a technique that reduces solving linear ODEs to solving linear programs.
arXiv Detail & Related papers (2024-02-20T15:23:24Z)
The Power of Training: How Different Neural Network Setups Influence the Energy Demand [5.526611783155303]
This work offers a evaluation of the effects of variations in machine learning training regimes and learning paradigms on the energy consumption of computing, especially HPC hardware with a life-cycle aware perspective.
arXiv Detail & Related papers (2024-01-03T17:44:17Z)
Computation-efficient Deep Learning for Computer Vision: A Survey [121.84121397440337]
Deep learning models have reached or even exceeded human-level performance in a range of visual perception tasks. Deep learning models usually demand significant computational resources, leading to impractical power consumption, latency, or carbon emissions in real-world scenarios. New research focus is computationally efficient deep learning, which strives to achieve satisfactory performance while minimizing the computational cost during inference.
arXiv Detail & Related papers (2023-08-27T03:55:28Z)
Learnability with Time-Sharing Computational Resource Concerns [65.268245109828]
We present a theoretical framework that takes into account the influence of computational resources in learning theory. This framework can be naturally applied to stream learning where the incoming data streams can be potentially endless. It may also provide a theoretical perspective for the design of intelligent supercomputing operating systems.
arXiv Detail & Related papers (2023-05-03T15:54:23Z)
Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption. Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy. We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z)
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks [78.47459801017959]
Sparsity can reduce the memory footprint of regular networks to fit mobile devices. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice.
arXiv Detail & Related papers (2021-01-31T22:48:50Z)
Resource-Efficient Neural Networks for Embedded Systems [23.532396005466627]
We provide an overview of the current state of the art of machine learning techniques. We focus on resource-efficient inference based on deep neural networks (DNNs), the predominant machine learning models of the past decade. We substantiate our discussion with experiments on well-known benchmark data sets using compression techniques.
arXiv Detail & Related papers (2020-01-07T14:17:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.