Training Personalized Recommendation Systems from (GPU) Scratch: Look
Forward not Backwards
- URL: http://arxiv.org/abs/2205.04702v1
- Date: Tue, 10 May 2022 07:05:20 GMT
- Title: Training Personalized Recommendation Systems from (GPU) Scratch: Look
Forward not Backwards
- Authors: Youngeun Kwon, Minsoo Rhu
- Abstract summary: Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers.
A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size.
In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers.
- Score: 1.7733623930581417
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Personalized recommendation models (RecSys) are one of the most popular
machine learning workload serviced by hyperscalers. A critical challenge of
training RecSys is its high memory capacity requirements, reaching hundreds of
GBs to TBs of model size. In RecSys, the so-called embedding layers account for
the majority of memory usage so current systems employ a hybrid CPU-GPU design
to have the large CPU memory store the memory hungry embedding layers.
Unfortunately, training embeddings involve several memory bandwidth intensive
operations which is at odds with the slow CPU memory, causing performance
overheads. Prior work proposed to cache frequently accessed embeddings inside
GPU memory as means to filter down the embedding layer traffic to CPU memory,
but this paper observes several limitations with such cache design. In this
work, we present a fundamentally different approach in designing embedding
caches for RecSys. Our proposed ScratchPipe architecture utilizes unique
properties of RecSys training to develop an embedding cache that not only sees
the past but also the "future" cache accesses. ScratchPipe exploits such
property to guarantee that the active working set of embedding layers can
"always" be captured inside our proposed cache design, enabling embedding layer
training to be conducted at GPU memory speed.
Related papers
- Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.
On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.
We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z) - XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference [9.65524177141491]
Large Language Model (LLM) inference generates output tokens one-by-one, leading to many redundant computations.
KV-Cache framework makes a compromise between time and space complexities.
Existing studies reduce memory consumption by evicting some of cached data that have less important impact on inference accuracy.
We show that customizing the cache size for each layer in a personalized manner can yield a significant memory reduction.
arXiv Detail & Related papers (2024-12-08T11:32:08Z) - Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios.
For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations.
For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z) - InstCache: A Predictive Cache for LLM Serving [9.878166964839512]
We propose to predict user-instructions by an instruction-aligned LLM and store them in a predictive cache, so-called InstCache.
Experimental results show that InstCache can achieve up to 51.34% hit rate on LMSys dataset, which corresponds to a 2x speedup, at a memory cost of only 4.5GB.
arXiv Detail & Related papers (2024-11-21T03:52:41Z) - ProMoE: Fast MoE-based LLM Serving using Proactive Caching [4.4026892123375605]
We introduce ProMoE, a novel proactive caching system that utilizes intermediate results to predict subsequent expert usage.
ProMoE achieves an average speedup of 2.20x (up to 3.21x) and 2.07x (up to 5.02x) in the prefill and decode stages, respectively.
arXiv Detail & Related papers (2024-10-29T15:31:27Z) - Compute Or Load KV Cache? Why Not Both? [6.982874528357836]
Cake is a novel KV cache loader, which employs a bidirectional parallelized KV cache generation strategy.
It simultaneously and dynamically loads saved KV cache from prefix cache locations and computes KV cache on local GPU.
It offers up to 68.1% Time To First Token (TTFT) reduction compare with compute-only method and 94.6% TTFT reduction compare with I/O-only method.
arXiv Detail & Related papers (2024-10-04T01:11:09Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [57.53291046180288]
Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference.
We propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context.
PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
arXiv Detail & Related papers (2024-05-21T06:46:37Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.