RAF: Holistic Compilation for Deep Learning Model Training
- URL: http://arxiv.org/abs/2303.04759v1
- Date: Wed, 8 Mar 2023 17:51:13 GMT
- Title: RAF: Holistic Compilation for Deep Learning Model Training
- Authors: Cody Hao Yu, Haozheng Fan, Guangtai Huang, Zhen Jia, Yizhi Liu, Jie
Wang, Zach Zheng, Yuan Zhou, Haichen Shen, Junru Shao, Mu Li, Yida Wang
- Abstract summary: In this paper, we present RAF, a deep learning compiler for training.
Unlike existing DLCs, RAF accepts a forward model and in-house generates a training graph.
RAF is able to systematically consolidate graph optimizations for performance, memory and distributed training.
- Score: 17.956035630476173
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As deep learning is pervasive in modern applications, many deep learning
frameworks are presented for deep learning practitioners to develop and train
DNN models rapidly. Meanwhile, as training large deep learning models becomes a
trend in recent years, the training throughput and memory footprint are getting
crucial. Accordingly, optimizing training workloads with compiler optimizations
is inevitable and getting more and more attentions. However, existing deep
learning compilers (DLCs) mainly target inference and do not incorporate
holistic optimizations, such as automatic differentiation and automatic mixed
precision, in training workloads.
In this paper, we present RAF, a deep learning compiler for training. Unlike
existing DLCs, RAF accepts a forward model and in-house generates a training
graph. Accordingly, RAF is able to systematically consolidate graph
optimizations for performance, memory and distributed training. In addition, to
catch up to the state-of-the-art performance with hand-crafted kernel libraries
as well as tensor compilers, RAF proposes an operator dialect mechanism to
seamlessly integrate all possible kernel implementations. We demonstrate that
by in-house training graph generation and operator dialect mechanism, we are
able to perform holistic optimizations and achieve either better training
throughput or larger batch size against PyTorch (eager and torchscript mode),
XLA, and DeepSpeed for popular transformer models on GPUs.
Related papers
- CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization [10.319009303849109]
Training large AI models such as deep learning recommendation systems and foundation language (or multi-modal) models costs massive GPU and computing time.
CoMERA achieves end-to-end rank-adaptive tensor-compressed training via a multi-objective optimization formulation.
CoMERA is $2times$ faster per training epoch and $9times$ more memory-efficient than GaLore on a tested six-encoder transformer with single-batch training.
arXiv Detail & Related papers (2024-05-23T09:52:15Z) - Always-Sparse Training by Growing Connections with Guided Stochastic
Exploration [46.4179239171213]
We propose an efficient always-sparse training algorithm with excellent scaling to larger and sparser models.
We evaluate our method on CIFAR-10/100 and ImageNet using VGG, and ViT models, and compare it against a range of sparsification methods.
arXiv Detail & Related papers (2024-01-12T21:32:04Z) - PILOT: A Pre-Trained Model-Based Continual Learning Toolbox [71.63186089279218]
This paper introduces a pre-trained model-based continual learning toolbox known as PILOT.
On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt.
On the other hand, PILOT fits typical class-incremental learning algorithms within the context of pre-trained models to evaluate their effectiveness.
arXiv Detail & Related papers (2023-09-13T17:55:11Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Deep Learning Models on CPUs: A Methodology for Efficient Training [1.7150798380270715]
This paper makes several contributions to research on training deep learning models using CPUs.
It presents a method for optimizing the training of deep learning models on Intel CPUs and a toolkit called ProfileDNN.
arXiv Detail & Related papers (2022-06-20T22:42:14Z) - Training Efficiency and Robustness in Deep Learning [2.6451769337566406]
We study approaches to improve the training efficiency and robustness of deep learning models.
We find that prioritizing learning on more informative training data increases convergence speed and improves generalization performance on test data.
We show that a redundancy-aware modification to the sampling of training data improves the training speed and develops an efficient method for detecting the diversity of training signal.
arXiv Detail & Related papers (2021-12-02T17:11:33Z) - LCS: Learning Compressible Subspaces for Adaptive Network Compression at
Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.