Project CGX: Scalable Deep Learning on Commodity GPUs
- URL: http://arxiv.org/abs/2111.08617v2
- Date: Wed, 17 Nov 2021 14:00:02 GMT
- Title: Project CGX: Scalable Deep Learning on Commodity GPUs
- Authors: Ilia Markov, Hamidreza Ramezanikebrya, Dan Alistarh
- Abstract summary: This paper investigates whether hardware overprovisioning can be supplanted via algorithmic and system design.
We propose a framework called CGX, which provides efficient software support for communication compression.
We show that this framework is able to remove communication bottlenecks from consumer-grade multi-GPU systems.
- Score: 17.116792714097738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to scale out training workloads has been one of the key
performance enablers of deep learning. The main scaling approach is
data-parallel GPU-based training, which has been boosted by hardware and
software support for highly efficient inter-GPU communication, in particular
via bandwidth overprovisioning. This support comes at a price: there is an
order of magnitude cost difference between "cloud-grade" servers with such
support, relative to their "consumer-grade" counterparts, although server-grade
and consumer-grade GPUs can have similar computational envelopes. In this
paper, we investigate whether the expensive hardware overprovisioning approach
can be supplanted via algorithmic and system design, and propose a framework
called CGX, which provides efficient software support for communication
compression. We show that this framework is able to remove communication
bottlenecks from consumer-grade multi-GPU systems, in the absence of hardware
support: when training modern models and tasks to full accuracy, our framework
enables self-speedups of 2-3X on a commodity system using 8 consumer-grade
NVIDIA RTX 3090 GPUs, and enables it to surpass the throughput of an NVIDIA
DGX-1 server, which has similar peak FLOPS but benefits from bandwidth
overprovisioning.
Related papers
- Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects [15.145701300309337]
This paper characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design.
We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks.
Our results show that there is untapped bandwidth, and there are still many opportunities for optimization.
arXiv Detail & Related papers (2024-08-26T08:20:50Z) - PockEngine: Sparse and Efficient Fine-tuning in a Pocket [62.955793932377524]
We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices.
PockEngine supports sparse backpropagation and sparsely updates the model with measured memory saving and latency reduction.
Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9$times$ faster than the PyTorch.
arXiv Detail & Related papers (2023-10-26T19:46:11Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - A Frequency-aware Software Cache for Large Recommendation System
Embeddings [11.873521953539361]
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.
We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space.
Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
arXiv Detail & Related papers (2022-08-08T12:08:05Z) - PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable
Multi-GPU Inference Servers [0.9854614058492648]
NVIDIA's Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions"
In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server.
arXiv Detail & Related papers (2022-02-27T23:30:55Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - GPU Domain Specialization via Composable On-Package Architecture [0.8240720472180706]
Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products.
We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
arXiv Detail & Related papers (2021-04-05T23:06:50Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.