Training Personalized Recommendation Systems from (GPU) Scratch: Look
Forward not Backwards
- URL: http://arxiv.org/abs/2205.04702v1
- Date: Tue, 10 May 2022 07:05:20 GMT
- Title: Training Personalized Recommendation Systems from (GPU) Scratch: Look
Forward not Backwards
- Authors: Youngeun Kwon, Minsoo Rhu
- Abstract summary: Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers.
A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size.
In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers.
- Score: 1.7733623930581417
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Personalized recommendation models (RecSys) are one of the most popular
machine learning workload serviced by hyperscalers. A critical challenge of
training RecSys is its high memory capacity requirements, reaching hundreds of
GBs to TBs of model size. In RecSys, the so-called embedding layers account for
the majority of memory usage so current systems employ a hybrid CPU-GPU design
to have the large CPU memory store the memory hungry embedding layers.
Unfortunately, training embeddings involve several memory bandwidth intensive
operations which is at odds with the slow CPU memory, causing performance
overheads. Prior work proposed to cache frequently accessed embeddings inside
GPU memory as means to filter down the embedding layer traffic to CPU memory,
but this paper observes several limitations with such cache design. In this
work, we present a fundamentally different approach in designing embedding
caches for RecSys. Our proposed ScratchPipe architecture utilizes unique
properties of RecSys training to develop an embedding cache that not only sees
the past but also the "future" cache accesses. ScratchPipe exploits such
property to guarantee that the active working set of embedding layers can
"always" be captured inside our proposed cache design, enabling embedding layer
training to be conducted at GPU memory speed.
Related papers
- Compute Or Load KV Cache? Why Not Both? [6.982874528357836]
Cake is a novel KV cache loader, which employs a bidirectional parallelized KV cache generation strategy.
It simultaneously and dynamically loads saved KV cache from prefix cache locations and computes KV cache on local GPU.
It offers up to 68.1% Time To First Token (TTFT) reduction compare with compute-only method and 94.6% TTFT reduction compare with I/O-only method.
arXiv Detail & Related papers (2024-10-04T01:11:09Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [57.53291046180288]
Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference.
We propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context.
PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
arXiv Detail & Related papers (2024-05-21T06:46:37Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - Optimizing L1 cache for embedded systems through grammatical evolution [1.9371782627708491]
Grammatical Evolution (GE) is able to efficiently find the best cache configurations for a given set of benchmark applications.
Our proposal is able to find cache configurations that obtain an average improvement of $62%$ versus a real world baseline configuration.
arXiv Detail & Related papers (2023-03-06T18:10:00Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z) - High-Performance Training by Exploiting Hot-Embeddings in Recommendation
Systems [2.708848417398231]
Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications.
These models use massive embedding tables to store a numerical representation of item's and user's categorical variables.
Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU.
This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training.
arXiv Detail & Related papers (2021-03-01T01:43:26Z) - CNN with large memory layers [2.368995563245609]
This work is centred around the recently proposed product key memory structure citelarge_memory, implemented for a number of computer vision applications.
The memory structure can be regarded as a simple computation primitive suitable to be augmented to nearly all neural network architectures.
arXiv Detail & Related papers (2021-01-27T20:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.