Integrated Hardware Architecture and Device Placement Search
- URL: http://arxiv.org/abs/2407.13143v1
- Date: Thu, 18 Jul 2024 04:02:35 GMT
- Title: Integrated Hardware Architecture and Device Placement Search
- Authors: Irene Wang, Jakub Tarnawski, Amar Phanishayee, Divya Mahajan,
- Abstract summary: Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy.
This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy.
Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework.
- Score: 7.620610652090732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy. This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy through novel algorithms, improving the balance of computational resources, memory usage, and data distribution. Our architecture search leverages tensor and vector units, determining their quantity and dimensionality, and on-chip and off-chip memory configurations. It also determines the microbatch size and decides whether to recompute or stash activations, balancing the memory footprint of training and storage size. For each explored architecture configuration, we use an Integer Linear Program (ILP) to find the optimal schedule for executing operators on the accelerator. The ILP results then integrate with a dynamic programming solution to identify the most effective device placement strategy, combining data, pipeline, and tensor model parallelism across multiple accelerators. Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework. The entire source code of PHAZE is available at https://github.com/msr-fiddle/phaze.
Related papers
- Performance and Energy Consumption of Parallel Machine Learning
Algorithms [0.0]
Machine learning models have achieved remarkable success in various real-world applications.
Model training in machine learning requires large-scale data sets and multiple iterations before it can work properly.
Parallelization of training algorithms is a common strategy to speed up the process of training.
arXiv Detail & Related papers (2023-05-01T13:04:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Shapley-NAS: Discovering Operation Contribution for Neural Architecture
Search [96.20505710087392]
We propose a Shapley value based method to evaluate operation contribution (Shapley-NAS) for neural architecture search.
We show that our method outperforms the state-of-the-art methods by a considerable margin with light search cost.
arXiv Detail & Related papers (2022-06-20T14:41:49Z) - DAAS: Differentiable Architecture and Augmentation Policy Search [107.53318939844422]
This work considers the possible coupling between neural architectures and data augmentation and proposes an effective algorithm jointly searching for them.
Our approach achieves 97.91% accuracy on CIFAR-10 and 76.6% Top-1 accuracy on ImageNet dataset, showing the outstanding performance of our search algorithm.
arXiv Detail & Related papers (2021-09-30T17:15:17Z) - Hard-ODT: Hardware-Friendly Online Decision Tree Learning Algorithm and
System [17.55491405857204]
In the era of big data, traditional decision tree induction algorithms are not suitable for learning large-scale datasets.
We introduce a new quantile-based algorithm to improve the induction of the Hoeffding tree, one of the state-of-the-art online learning models.
We present Hard-ODT, a high-performance, hardware-efficient and scalable online decision tree learning system on a field-programmable gate array (FPGA) with system-level optimization techniques.
arXiv Detail & Related papers (2020-12-11T12:06:44Z) - ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators
using Reinforcement Learning [5.251940442946459]
We propose an autonomous strategy called ConfuciuX to find optimized HW resource assignments for a given model and dataflow style.
It converges to the optimized hardware configuration 4.7 to 24 times faster than alternate techniques.
arXiv Detail & Related papers (2020-09-04T04:59:26Z) - Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for
DNN Workloads [11.646744408920764]
Auto-MAP is a framework for exploring distributed execution plans for workloads.
It can automatically discovering fast parallelization strategies through reinforcement learning on IR level of deep learning models.
Our evaluation shows that Auto-MAP can find the optimal solution in two hours, while achieving better throughput on several NLP and convolution models.
arXiv Detail & Related papers (2020-07-08T12:38:03Z) - Automated Design Space Exploration for optimised Deployment of DNN on
Arm Cortex-A CPUs [13.628734116014819]
Deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN)
There is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution.
We present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory.
arXiv Detail & Related papers (2020-06-09T11:00:06Z) - Stage-Wise Neural Architecture Search [65.03109178056937]
Modern convolutional networks such as ResNet and NASNet have achieved state-of-the-art results in many computer vision applications.
These networks consist of stages, which are sets of layers that operate on representations in the same resolution.
It has been demonstrated that increasing the number of layers in each stage improves the prediction ability of the network.
However, the resulting architecture becomes computationally expensive in terms of floating point operations, memory requirements and inference time.
arXiv Detail & Related papers (2020-04-23T14:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.