G10: Enabling An Efficient Unified GPU Memory and Storage Architecture
with Smart Tensor Migrations
- URL: http://arxiv.org/abs/2310.09443v1
- Date: Fri, 13 Oct 2023 23:32:28 GMT
- Title: G10: Enabling An Efficient Unified GPU Memory and Storage Architecture
with Smart Tensor Migrations
- Authors: Haoyang Zhang, Yirui Eric Zhou, Yuqi Xue, Yiqi Liu, and Jian Huang
- Abstract summary: unified GPU memory and storage architecture named G10.
G10 integrates the host memory, GPU memory, and flash memory into a unified memory space.
Experiments demonstrate that G10 outperforms state-of-the-art GPU memory solutions by up to 1.75$times$.
- Score: 5.752074124514541
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To break the GPU memory wall for scaling deep learning workloads, a variety
of architecture and system techniques have been proposed recently. Their
typical approaches include memory extension with flash memory and direct
storage access. However, these techniques still suffer from suboptimal
performance and introduce complexity to the GPU memory management, making them
hard to meet the scalability requirement of deep learning workloads today. In
this paper, we present a unified GPU memory and storage architecture named G10
driven by the fact that the tensor behaviors of deep learning workloads are
highly predictable. G10 integrates the host memory, GPU memory, and flash
memory into a unified memory space, to scale the GPU memory capacity while
enabling transparent data migrations. Based on this unified GPU memory and
storage architecture, G10 utilizes compiler techniques to characterize the
tensor behaviors in deep learning workloads. Therefore, it can schedule data
migrations in advance by considering the available bandwidth of flash memory
and host memory. The cooperative mechanism between deep learning compilers and
the unified memory architecture enables G10 to hide data transfer overheads in
a transparent manner. We implement G10 based on an open-source GPU simulator.
Our experiments demonstrate that G10 outperforms state-of-the-art GPU memory
solutions by up to 1.75$\times$, without code modifications to deep learning
workloads. With the smart data migration mechanism, G10 can reach 90.3\% of the
performance of the ideal case assuming unlimited GPU memory.
Related papers
- Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference [4.497936996651617]
Large language models have been widely adopted across different tasks, but their auto-regressive nature often leads to inefficient resource utilization during inference.<n>In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized.
arXiv Detail & Related papers (2025-03-11T11:21:35Z) - Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.
On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.
We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z) - APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.
Various memory-efficient Scals have been proposed to reduce memory usage.
They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z) - Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios.
For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations.
For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z) - Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module.
B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - GEAR: A GPU-Centric Experience Replay System for Large Reinforcement
Learning Models [32.23853007467266]
GEAR is designed to perform scalable reinforcement learning (RL) with large sequence models (such as transformers)
It is equipped with GPU kernels capable of collecting trajectories using zero-copy access to host memory, along with remote-directed-memory access over InfiniBand.
Gear can achieve performance levels up to 6x greater than Reverb when training state-of-the-art large RL models.
arXiv Detail & Related papers (2023-10-08T15:39:43Z) - XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin
Memory Model [137.50614198301733]
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores.
We develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores.
XMem greatly exceeds state-of-the-art performance on long-video datasets.
arXiv Detail & Related papers (2022-07-14T17:59:37Z) - Hierarchical Memory Matching Network for Video Object Segmentation [38.24999776705497]
We propose two advanced memory read modules that enable us to perform memory in multiple scales while exploiting temporal smoothness.
We first propose a guided memory matching module that replaces the non-local dense memory read, commonly adopted in previous memory-based methods.
We introduce a hierarchical memory matching scheme and propose a top-k guided memory matching module in which memory read on a fine-scale is guided by that on a coarse-scale.
arXiv Detail & Related papers (2021-09-23T14:36:43Z) - TENSILE: A Tensor granularity dynamic GPU memory scheduler method
towards multiple dynamic workloads system [9.86589655261934]
TENSILE is a method of managing GPU memory in tensor granularity to reduce the GPU memory peak.
We implement TENSILE on our own deep learning framework, and evaluated its performance.
arXiv Detail & Related papers (2021-05-27T17:46:16Z) - Large Graph Convolutional Network Training with GPU-Oriented Data
Communication Architecture [19.2129567657739]
Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems.
Current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features.
This approach, however, puts tremendous pressure on host memory bandwidth and the CPU.
We propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory.
arXiv Detail & Related papers (2021-03-04T21:00:17Z) - Video Object Segmentation with Episodic Graph Memory Networks [198.74780033475724]
A graph memory network is developed to address the novel idea of "learning to update the segmentation model"
We exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges.
The proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks.
arXiv Detail & Related papers (2020-07-14T13:19:19Z) - DMV: Visual Object Tracking via Part-level Dense Memory and Voting-based
Retrieval [61.366644088881735]
We propose a novel memory-based tracker via part-level dense memory and voting-based retrieval, called DMV.
We also propose a novel voting mechanism for the memory reading to filter out unreliable information in the memory.
arXiv Detail & Related papers (2020-03-20T10:05:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.