Horizontally Fused Training Array: An Effective Hardware Utilization
Squeezer for Training Novel Deep Learning Models
- URL: http://arxiv.org/abs/2102.02344v1
- Date: Wed, 3 Feb 2021 23:56:55 GMT
- Title: Horizontally Fused Training Array: An Effective Hardware Utilization
Squeezer for Training Novel Deep Learning Models
- Authors: Shang Wang, Peiming Yang, Yuxuan Zheng, Xin Li, Gennady Pekhimenko
- Abstract summary: We show that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively.
We propose Horizontally Fused Training Array (HFTA) to help DL researchers and practitioners effectively and easily improve the hardware utilization of their novel DL training workloads.
HFTA demonstrates strong effectiveness in squeezing out hardware utilization and achieves up to $15.1 times$ higher training throughput vs. the standard practice of running each job on a separate accelerator.
- Score: 8.055533378391814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Driven by the tremendous effort in researching novel deep learning (DL)
algorithms, the training cost of developing new models increases staggeringly
in recent years. To reduce this training cost and optimize the cluster-wide
hardware resource usage, we analyze GPU cluster usage statistics from a
well-known research institute. Our study reveals that single-accelerator
training jobs can dominate the cluster-wide resource consumption when launched
repetitively (e.g., for hyper-parameter tuning) while severely underutilizing
the hardware. This is because DL researchers and practitioners often lack the
required expertise to independently optimize their own workloads. Fortunately,
we observe that such workloads have the following unique characteristics: (i)
the models among jobs often have the same types of operators with the same
shapes, and (ii) the inter-model horizontal fusion of such operators is
mathematically equivalent to other already well-optimized operators. Thus, to
help DL researchers and practitioners effectively and easily improve the
hardware utilization of their novel DL training workloads, we propose
Horizontally Fused Training Array (HFTA). HFTA is a new DL framework extension
library that horizontally fuses the models from different repetitive jobs
deeply down to operators, and then trains those models simultaneously on a
shared accelerator. On three emerging DL training workloads and
state-of-the-art accelerators (GPUs and TPUs), HFTA demonstrates strong
effectiveness in squeezing out hardware utilization and achieves up to $15.1
\times$ higher training throughput vs. the standard practice of running each
job on a separate accelerator.
Related papers
- Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z) - Training Deep Surrogate Models with Large Scale Online Learning [48.7576911714538]
Deep learning algorithms have emerged as a viable alternative for obtaining fast solutions for PDEs.
Models are usually trained on synthetic data generated by solvers, stored on disk and read back for training.
It proposes an open source online training framework for deep surrogate models.
arXiv Detail & Related papers (2023-06-28T12:02:27Z) - Rethinking Closed-loop Training for Autonomous Driving [82.61418945804544]
We present the first empirical study which analyzes the effects of different training benchmark designs on the success of learning agents.
We propose trajectory value learning (TRAVL), an RL-based driving agent that performs planning with multistep look-ahead.
Our experiments show that TRAVL can learn much faster and produce safer maneuvers compared to all the baselines.
arXiv Detail & Related papers (2023-06-27T17:58:39Z) - RAF: Holistic Compilation for Deep Learning Model Training [17.956035630476173]
In this paper, we present RAF, a deep learning compiler for training.
Unlike existing DLCs, RAF accepts a forward model and in-house generates a training graph.
RAF is able to systematically consolidate graph optimizations for performance, memory and distributed training.
arXiv Detail & Related papers (2023-03-08T17:51:13Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - DL-DRL: A double-level deep reinforcement learning approach for
large-scale task scheduling of multi-UAV [65.07776277630228]
We propose a double-level deep reinforcement learning (DL-DRL) approach based on a divide and conquer framework (DCF)
Particularly, we design an encoder-decoder structured policy network in our upper-level DRL model to allocate the tasks to different UAVs.
We also exploit another attention based policy network in our lower-level DRL model to construct the route for each UAV, with the objective to maximize the number of executed tasks.
arXiv Detail & Related papers (2022-08-04T04:35:53Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Regularizing Generative Adversarial Networks under Limited Data [88.57330330305535]
This work proposes a regularization approach for training robust GAN models on limited data.
We show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data.
arXiv Detail & Related papers (2021-04-07T17:59:06Z) - Multi-node Bert-pretraining: Cost-efficient Approach [6.5998084177955425]
Large scale Transformer-based language models have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks.
With the advent of large-scale unsupervised datasets, training time is further extended due to the increased amount of data samples within a single training epoch.
We show that we are able to perform pre-training on BERT within a reasonable time budget (12 days) in an academic setting.
arXiv Detail & Related papers (2020-08-01T05:49:20Z) - Effective Elastic Scaling of Deep Learning Workloads [3.345876096131764]
We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms.
We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
arXiv Detail & Related papers (2020-06-24T17:01:09Z) - Optimizing Memory-Access Patterns for Deep Learning Accelerators [6.931196464448543]
Deep learning (DL) workloads are moving towards accelerators for faster processing and lower cost.
Modern DL accelerators are good at handling the large-scale multiply-accumulate operations that dominate DL workloads.
It is challenging to make full use of the compute power of an accelerator since the data must be properly staged in a software-managed scratchpad memory.
This paper proposes a systematic approach which leverages the polyhedral model to analyze all operators of a DL model together to minimize the number of memory accesses.
arXiv Detail & Related papers (2020-02-27T05:06:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.