Related papers: GPU Domain Specialization via Composable On-Package Architecture

GPU Domain Specialization via Composable On-Package Architecture

URL: http://arxiv.org/abs/2104.02188v1
Date: Mon, 5 Apr 2021 23:06:50 GMT
Title: GPU Domain Specialization via Composable On-Package Architecture
Authors: Yaosheng Fu, Evgeny Bolotin, Niladrish Chatterjee, David Nellans, Stephen W. Keckler
Abstract summary: Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
Score: 0.8240720472180706
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As GPUs scale their low precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that converged GPU design trying to address diverging architectural requirements between FP32 (or larger) based HPC and FP16 (or smaller) based DL workloads results in sub-optimal configuration for either of the application domains. We argue that a Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16x larger cache capacity and 1.6x higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35% respectively and reduces the number of GPU instances by 50% in scale-out training scenarios.

Related papers

Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction [76.62155593340763]
Equivariant Graph Neural Networks (eGNNs) trained on density-functional theory (DFT) data can potentially perform electronic structure prediction at unprecedented scales.<n>However, the graph representations required for this task tend to be densely connected.<n>We present a distributed eGNN implementation which leverages direct GPU communication and introduce a partitioning strategy of the input graph.
arXiv Detail & Related papers (2025-07-04T23:53:47Z)
Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing [3.50604837678178]
We propose a memory-intensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system.
arXiv Detail & Related papers (2025-04-18T03:31:08Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs) The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
An Analysis of Collocation on GPUs for Deep Learning Training [0.0]
Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads. In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models.
arXiv Detail & Related papers (2022-09-13T14:13:06Z)
A Frequency-aware Software Cache for Large Recommendation System Embeddings [11.873521953539361]
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies. We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space. Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
arXiv Detail & Related papers (2022-08-08T12:08:05Z)
PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers [0.9854614058492648]
NVIDIA's Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions" In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server.
arXiv Detail & Related papers (2022-02-27T23:30:55Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Project CGX: Scalable Deep Learning on Commodity GPUs [17.116792714097738]
This paper investigates whether hardware overprovisioning can be supplanted via algorithmic and system design. We propose a framework called CGX, which provides efficient software support for communication compression. We show that this framework is able to remove communication bottlenecks from consumer-grade multi-GPU systems.
arXiv Detail & Related papers (2021-11-16T17:00:42Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.