GPU Domain Specialization via Composable On-Package Architecture
- URL: http://arxiv.org/abs/2104.02188v1
- Date: Mon, 5 Apr 2021 23:06:50 GMT
- Title: GPU Domain Specialization via Composable On-Package Architecture
- Authors: Yaosheng Fu, Evgeny Bolotin, Niladrish Chatterjee, David Nellans,
Stephen W. Keckler
- Abstract summary: Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products.
We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
- Score: 0.8240720472180706
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As GPUs scale their low precision matrix math throughput to boost deep
learning (DL) performance, they upset the balance between math throughput and
memory system capabilities. We demonstrate that converged GPU design trying to
address diverging architectural requirements between FP32 (or larger) based HPC
and FP16 (or smaller) based DL workloads results in sub-optimal configuration
for either of the application domains. We argue that a Composable On-PAckage
GPU (COPAGPU) architecture to provide domain-specialized GPU products is the
most practical solution to these diverging requirements. A COPA-GPU leverages
multi-chip-module disaggregation to support maximal design reuse, along with
memory system specialization per application domain. We show how a COPA-GPU
enables DL-specialized products by modular augmentation of the baseline GPU
architecture with up to 4x higher off-die bandwidth, 32x larger on-package
cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting
scaled-down HPC-oriented designs. This work explores the microarchitectural
design necessary to enable composable GPUs and evaluates the benefits
composability can provide to HPC, DL training, and DL inference. We show that
when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a
combination of 16x larger cache capacity and 1.6x higher DRAM bandwidth scales
per-GPU training and inference performance by 31% and 35% respectively and
reduces the number of GPU instances by 50% in scale-out training scenarios.
Related papers
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs)
The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - An Analysis of Collocation on GPUs for Deep Learning Training [0.0]
Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads.
In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models.
arXiv Detail & Related papers (2022-09-13T14:13:06Z) - A Frequency-aware Software Cache for Large Recommendation System
Embeddings [11.873521953539361]
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.
We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space.
Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
arXiv Detail & Related papers (2022-08-08T12:08:05Z) - PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable
Multi-GPU Inference Servers [0.9854614058492648]
NVIDIA's Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions"
In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server.
arXiv Detail & Related papers (2022-02-27T23:30:55Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - Project CGX: Scalable Deep Learning on Commodity GPUs [17.116792714097738]
This paper investigates whether hardware overprovisioning can be supplanted via algorithmic and system design.
We propose a framework called CGX, which provides efficient software support for communication compression.
We show that this framework is able to remove communication bottlenecks from consumer-grade multi-GPU systems.
arXiv Detail & Related papers (2021-11-16T17:00:42Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.