Related papers: PaSE: Parallelization Strategies for Efficient DNN Training

PaSE: Parallelization Strategies for Efficient DNN Training

URL: http://arxiv.org/abs/2407.04001v1
Date: Thu, 4 Jul 2024 15:21:20 GMT
Title: PaSE: Parallelization Strategies for Efficient DNN Training
Authors: Venmugil Elango,
Abstract summary: Training a deep neural network (DNN) requires substantial computational and memory requirements. Standard practice is to use data parallelism because of its simplicity. Expert-designed strategies have been proposed on a case-by-case basis using domain specific knowledge.
Score: 0.09889128046943638
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training a deep neural network (DNN) requires substantial computational and memory requirements. It is common to use multiple devices to train a DNN to reduce the overall training time. There are several choices to parallelize each layer in a DNN. Exhaustively searching this list to find an optimal parallelization strategy is prohibitively time consuming and impractical. The standard practice is to use data parallelism because of its simplicity. However, data parallelism is often sub-optimal, and suffers from poor performance and high memory requirement. Expert-designed strategies have been proposed on a case-by-case basis using domain specific knowledge. These expert-designed strategies do not generalize well to DNNs other than the ones for which they were designed, and are not always necessarily the best choice. In this paper, we propose an approach to automatically find efficient parallelization strategies for DNNs from their computation graphs. We present an efficient algorithm to compute these strategies within a reasonable time in practice. We evaluate the effectiveness of our approach on various DNNs. We also compare the performance of the strategies identified by our approach against data parallelism, expert-designed strategies, and the state-of-the-art approaches. Our results show that the strategies found using our approach outperform the baseline data parallelism strategy in all the cases. In addition, our strategies achieve better performance than the expert-designed strategies and the state-of-the-art approaches.

Related papers

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute [55.330813919992465]
This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths.
arXiv Detail & Related papers (2025-04-01T13:13:43Z)
Deep Reinforcement Learning for Online Optimal Execution Strategies [49.1574468325115]
This paper tackles the challenge of learning non-Markovian optimal execution strategies in dynamic financial markets. We introduce a novel actor-critic algorithm based on Deep Deterministic Policy Gradient (DDPG) We show that our algorithm successfully approximates the optimal execution strategy.
arXiv Detail & Related papers (2024-10-17T12:38:08Z)
Scalable and Equitable Math Problem Solving Strategy Prediction in Big Educational Data [2.86829428083307]
We develop an embedding called MVec where we learn a representation based on the mastery of students. We then cluster these embeddings with a non-parametric clustering method. We show that our approach can scale up to achieve high accuracy by training on a small sample of a large dataset.
arXiv Detail & Related papers (2023-08-07T19:51:10Z)
Online Network Source Optimization with Graph-Kernel MAB [62.6067511147939]
We propose Grab-UCB, a graph- kernel multi-arms bandit algorithm to learn online the optimal source placement in large scale networks. We describe the network processes with an adaptive graph dictionary model, which typically leads to sparse spectral representations. We derive the performance guarantees that depend on network parameters, which further influence the learning curve of the sequential decision strategy.
arXiv Detail & Related papers (2023-07-07T15:03:42Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
Deep Reinforcement Learning for Exact Combinatorial Optimization: Learning to Branch [13.024115985194932]
We propose a new approach for solving the data labeling and inference issues in optimization based on the use of the reinforcement learning (RL) paradigm. We use imitation learning to bootstrap an RL agent and then use Proximal Policy (PPO) to further explore global optimal actions.
arXiv Detail & Related papers (2022-06-14T16:35:58Z)
Survey on Large Scale Neural Network Training [48.424512364338746]
Modern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training. This survey provides a systematic overview of the approaches that enable more efficient DNNs training.
arXiv Detail & Related papers (2022-02-21T18:48:02Z)
Understanding Robust Generalization in Learning Regular Languages [85.95124524975202]
We study robust generalization in the context of using recurrent neural networks to learn regular languages. We propose a compositional strategy to address this. We theoretically prove that the compositional strategy generalizes significantly better than the end-to-end strategy.
arXiv Detail & Related papers (2022-02-20T02:50:09Z)
DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution [15.086401550425125]
DistIR is a representation for distributed computation that is tailored for efficient analyses. We show how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations.
arXiv Detail & Related papers (2021-11-09T21:32:51Z)
Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads [11.646744408920764]
Auto-MAP is a framework for exploring distributed execution plans for workloads. It can automatically discovering fast parallelization strategies through reinforcement learning on IR level of deep learning models. Our evaluation shows that Auto-MAP can find the optimal solution in two hours, while achieving better throughput on several NLP and convolution models.
arXiv Detail & Related papers (2020-07-08T12:38:03Z)
TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism [21.980316675614787]
A good parallelization strategy can significantly improve the efficiency or reduce the cost for the distributed training of deep neural networks (DNNs) We propose FT, an efficient algorithm that searches for an optimal set of parallelization strategies to allow the trade-off among different objectives. We also develop a user-friendly system, calledOpt, which allows users to run their distributed DNN training jobs without caring the details of parallelization strategies.
arXiv Detail & Related papers (2020-04-16T02:57:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.