Efficient Algorithms for Device Placement of DNN Graph Operators
- URL: http://arxiv.org/abs/2006.16423v2
- Date: Thu, 29 Oct 2020 19:07:35 GMT
- Title: Efficient Algorithms for Device Placement of DNN Graph Operators
- Authors: Jakub Tarnawski, Amar Phanishayee, Nikhil R. Devanur, Divya Mahajan,
Fanny Nina Paravecino
- Abstract summary: Modern machine learning workloads use large models, with complex structures, that are very expensive to execute.
The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of domain-specific accelerators being offered as hardware accelerators in addition to CPUs.
Recent work has shown that significant gains can be obtained with model parallelism, i.e., partitioning a neural network's computational graph onto multiple devices.
In this paper, we identify and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings.
- Score: 12.871398348743591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern machine learning workloads use large models, with complex structures,
that are very expensive to execute. The devices that execute complex models are
becoming increasingly heterogeneous as we see a flourishing of domain-specific
accelerators being offered as hardware accelerators in addition to CPUs. These
trends necessitate distributing the workload across multiple devices. Recent
work has shown that significant gains can be obtained with model parallelism,
i.e, partitioning a neural network's computational graph onto multiple devices.
In particular, this form of parallelism assumes a pipeline of devices, which is
fed a stream of samples and yields high throughput for training and inference
of DNNs. However, for such settings (large models and multiple heterogeneous
devices), we require automated algorithms and toolchains that can partition the
ML workload across devices. In this paper, we identify and isolate the
structured optimization problem at the core of device placement of DNN
operators, for both inference and training, especially in modern pipelined
settings. We then provide algorithms that solve this problem to optimality. We
demonstrate the applicability and efficiency of our approaches using several
contemporary DNN computation graphs.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Real-time Hyper-Dimensional Reconfiguration at the Edge using Hardware
Accelerators [12.599871451119538]
HyDRATE can perform real-time reconfiguration at the edge using deep neural nets (DNN) combined with hyperdimensional (HD) computing accelerators.
We describe the algorithm, trained quantized model generation, and simulated performance of a feature extractor free of multiply-accumulates.
We show that reconfigurability in the field is achieved by retraining only the feed-forward HD classifier without descent gradient backpropagation.
arXiv Detail & Related papers (2022-06-10T14:08:41Z) - Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency
Analysis [28.464210819376593]
Graph neural networks (GNNs) are among the most powerful tools in deep learning.
They routinely solve complex problems on unstructured networks, such as node classification, graph classification, or link prediction, with high accuracy.
However, both inference and training of GNNs are complex, and they uniquely combine the features of irregular graph processing with dense and regular computations.
This complexity makes it very challenging to execute GNNs efficiently on modern massively parallel architectures.
arXiv Detail & Related papers (2022-05-19T17:11:45Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Collaborative Learning over Wireless Networks: An Introductory Overview [84.09366153693361]
We will mainly focus on collaborative training across wireless devices.
Many distributed optimization algorithms have been developed over the last decades.
They provide data locality; that is, a joint model can be trained collaboratively while the data available at each participating device remains local.
arXiv Detail & Related papers (2021-12-07T20:15:39Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Learning on Hardware: A Tutorial on Neural Network Accelerators and
Co-Processors [0.0]
Deep neural networks (DNNs) have the advantage that they can take into account a large number of parameters, which enables them to solve complex tasks.
In computer vision and speech recognition, they have a better accuracy than common algorithms, and in some tasks, they boast an even higher accuracy than human experts.
With the progress of DNNs in recent years, many other fields of application such as diagnosis of diseases and autonomous driving are taking advantage of them.
arXiv Detail & Related papers (2021-04-19T12:50:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.