Deep Learning Models on CPUs: A Methodology for Efficient Training
- URL: http://arxiv.org/abs/2206.10034v2
- Date: Sun, 18 Jun 2023 17:34:26 GMT
- Title: Deep Learning Models on CPUs: A Methodology for Efficient Training
- Authors: Quchen Fu, Ramesh Chukka, Keith Achorn, Thomas Atta-fosu, Deepak R.
Canchi, Zhongwei Teng, Jules White, and Douglas C. Schmidt
- Abstract summary: This paper makes several contributions to research on training deep learning models using CPUs.
It presents a method for optimizing the training of deep learning models on Intel CPUs and a toolkit called ProfileDNN.
- Score: 1.7150798380270715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GPUs have been favored for training deep learning models due to their highly
parallelized architecture. As a result, most studies on training optimization
focus on GPUs. There is often a trade-off, however, between cost and efficiency
when deciding on how to choose the proper hardware for training. In particular,
CPU servers can be beneficial if training on CPUs was more efficient, as they
incur fewer hardware update costs and better utilizing existing infrastructure.
This paper makes several contributions to research on training deep learning
models using CPUs. First, it presents a method for optimizing the training of
deep learning models on Intel CPUs and a toolkit called ProfileDNN, which we
developed to improve performance profiling. Second, we describe a generic
training optimization method that guides our workflow and explores several case
studies where we identified performance issues and then optimized the Intel
Extension for PyTorch, resulting in an overall 2x training performance increase
for the RetinaNet-ResNext50 model. Third, we show how to leverage the
visualization capabilities of ProfileDNN, which enabled us to pinpoint
bottlenecks and create a custom focal loss kernel that was two times faster
than the official reference PyTorch implementation.
Related papers
- Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - RAF: Holistic Compilation for Deep Learning Model Training [17.956035630476173]
In this paper, we present RAF, a deep learning compiler for training.
Unlike existing DLCs, RAF accepts a forward model and in-house generates a training graph.
RAF is able to systematically consolidate graph optimizations for performance, memory and distributed training.
arXiv Detail & Related papers (2023-03-08T17:51:13Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Scheduling Optimization Techniques for Neural Network Training [3.1617796705744547]
This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training.
We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
arXiv Detail & Related papers (2021-10-03T05:45:06Z) - Computational Performance Predictions for Deep Neural Network Training:
A Runtime-Based Approach [1.5857983167543392]
We present a new practical technique to help users make informed and cost-efficient GPU selections.
We make predictions by scaling the execution time of each operation in a training iteration from one GPU to another using either (i) wave scaling, a technique based on a GPU's execution model, or (ii) pre-trained multilayer perceptrons.
We implement our technique into a Python library called Surfer and find that it makes accurate iteration execution time predictions on ResNet-50, Inception v3, the Transformer, GNMT, and DCGAN.
arXiv Detail & Related papers (2021-01-31T20:17:46Z) - Optimising the Performance of Convolutional Neural Networks across
Computing Systems using Transfer Learning [0.08594140167290096]
We propose to replace a lengthy profiling stage with a machine learning based approach of performance modeling.
After training, our performance model can estimate the performance of convolutional primitives in any layer configuration.
The time to optimise the execution of large neural networks via primitive selection is reduced from hours to just seconds.
arXiv Detail & Related papers (2020-10-20T20:58:27Z) - Tasks, stability, architecture, and compute: Training more effective
learned optimizers, and using them to train themselves [53.37905268850274]
We introduce a new, hierarchical, neural network parameterized, hierarchical with access to additional features such as validation loss to enable automatic regularization.
Most learneds have been trained on only a single task, or a small number of tasks.
We train ours on thousands of tasks, making use of orders of magnitude more compute, resulting in generalizes that perform better to unseen tasks.
arXiv Detail & Related papers (2020-09-23T16:35:09Z) - Optimizing Memory Placement using Evolutionary Graph Reinforcement
Learning [56.83172249278467]
We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces.
We train and validate our approach directly on the Intel NNP-I chip for inference.
We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
arXiv Detail & Related papers (2020-07-14T18:50:12Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.