Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms
- URL: http://arxiv.org/abs/2004.08771v1
- Date: Sun, 19 Apr 2020 05:21:20 GMT
- Title: Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms
- Authors: Yujing Ma and Florin Rusu
- Abstract summary: We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
- Score: 1.3249453757295084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widely-adopted practice is to train deep learning models with specialized
hardware accelerators, e.g., GPUs or TPUs, due to their superior performance on
linear algebra operations. However, this strategy does not employ effectively
the extensive CPU and memory resources -- which are used only for
preprocessing, data transfer, and scheduling -- available by default on the
accelerated servers. In this paper, we study training algorithms for deep
learning on heterogeneous CPU+GPU architectures. Our two-fold objective --
maximize convergence rate and resource utilization simultaneously -- makes the
problem challenging. In order to allow for a principled exploration of the
design space, we first introduce a generic deep learning framework that
exploits the difference in computational power and memory hierarchy between CPU
and GPU through asynchronous message passing. Based on insights gained through
experimentation with the framework, we design two heterogeneous asynchronous
stochastic gradient descent (SGD) algorithms. The first algorithm -- CPU+GPU
Hogbatch -- combines small batches on CPU with large batches on GPU in order to
maximize the utilization of both resources. However, this generates an
unbalanced model update distribution which hinders the statistical convergence.
The second algorithm -- Adaptive Hogbatch -- assigns batches with continuously
evolving size based on the relative speed of CPU and GPU. This balances the
model updates ratio at the expense of a customizable decrease in utilization.
We show that the implementation of these algorithms in the proposed CPU+GPU
framework achieves both faster convergence and higher resource utilization than
TensorFlow on several real datasets and on two computing architectures -- an
on-premises server and a cloud instance.
Related papers
- Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions.
While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly.
We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z) - Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach [1.076745840431781]
We propose a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs.
This results in a maximum throughput improvement by a factor of 1.87 compared to the time-sharing scheduling.
arXiv Detail & Related papers (2024-05-14T16:40:06Z) - High Performance Computing Applied to Logistic Regression: A CPU and GPU
Implementation Comparison [0.0]
We present a versatile GPU-based parallel version of Logistic Regression (LR)
Our implementation is a direct translation of the parallel Gradient Descent Logistic Regression algorithm proposed by X. Zou et al.
Our method is particularly advantageous for real-time prediction applications like image recognition, spam detection, and fraud detection.
arXiv Detail & Related papers (2023-08-19T14:49:37Z) - Learning representations by forward-propagating errors [0.0]
Back-propagation (BP) is widely used learning algorithm for neural network optimization.
Current neural network optimizaiton is performed in graphical processing unit (GPU) with compute unified device architecture (CUDA) programming.
In this paper, we propose a light, fast learning algorithm on CPU that is fast as acceleration on GPU.
arXiv Detail & Related papers (2023-08-17T13:56:26Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.