Related papers: Demand Layering for Real-Time DNN Inference with Minimized Memory Usage

Demand Layering for Real-Time DNN Inference with Minimized Memory Usage

URL: http://arxiv.org/abs/2210.04024v1
Date: Sat, 8 Oct 2022 13:38:48 GMT
Title: Demand Layering for Real-Time DNN Inference with Minimized Memory Usage
Authors: Mingoo Ji, Saehanseul Yi, Changjin Koo, Sol Ahn, Dongjoo Seo, Nikil Dutt, Jong-Chan Kim
Abstract summary: Deep neural network (DNN) model parameters are loaded into GPU memory before execution. We present Demand Layering, which exploits the layer-by-layer execution of DNNs. Our implementation shows a 96.5% memory reduction with just 14.8% delay overhead on average.
Score: 2.5768647103950357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When executing a deep neural network (DNN), its model parameters are loaded into GPU memory before execution, incurring a significant GPU memory burden. There are studies that reduce GPU memory usage by exploiting CPU memory as a swap device. However, this approach is not applicable in most embedded systems with integrated GPUs where CPU and GPU share a common memory. In this regard, we present Demand Layering, which employs a fast solid-state drive (SSD) as a co-running partner of a GPU and exploits the layer-by-layer execution of DNNs. In our approach, a DNN is loaded and executed in a layer-by-layer manner, minimizing the memory usage to the order of a single layer. Also, we developed a pipeline architecture that hides most additional delays caused by the interleaved parameter loadings alongside layer executions. Our implementation shows a 96.5% memory reduction with just 14.8% delay overhead on average for representative DNNs. Furthermore, by exploiting the memory-delay tradeoff, near-zero delay overhead (under 1 ms) can be achieved with a slightly increased memory usage (still an 88.4% reduction), showing the great potential of Demand Layering.

Related papers

HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading [79.38548165722229]
HEADINFER offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. We demonstrate HEADINFER maintains computational efficiency while significantly reducing memory footprint.
arXiv Detail & Related papers (2025-02-18T06:26:05Z)
Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z)
Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios. For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations. For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains. Training of transformers is very expensive and often hits a memory wall'' We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z)
Less Memory Means smaller GPUs: Backpropagation with Compressed Activations [1.7065506903618906]
The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements. Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators. With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule.
arXiv Detail & Related papers (2024-09-18T11:57:05Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme [12.64360444043247]
Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques. We propose Large-scale Storage-based Multi- GPU GNN framework (LSM-GNN) LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information.
arXiv Detail & Related papers (2024-07-21T20:41:39Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs. We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z)
AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning [1.5301777464637454]
AxoNN is a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times.
arXiv Detail & Related papers (2021-10-25T14:43:36Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters. Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z)
TFApprox: Towards a Fast Emulation of DNN Approximate Hardware Accelerators on GPU [0.4817429789586127]
Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. A software emulation of the DNN accelerator is usually executed on CPU or GPU. This emulation is typically two or three orders of magnitude slower than a software DNN implementation running on or emulated.
arXiv Detail & Related papers (2020-02-21T08:22:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.