Mixed-Precision Embedding Using a Cache
- URL: http://arxiv.org/abs/2010.11305v2
- Date: Fri, 23 Oct 2020 01:37:34 GMT
- Title: Mixed-Precision Embedding Using a Cache
- Authors: Jie Amy Yang, Jianyu Huang, Jongsoo Park, Ping Tak Peter Tang, Andrew
Tulloch
- Abstract summary: We propose a novel change to embedding tables using a cache memory architecture, where the majority of rows in an embedding is trained in low precision.
For an open source deep learning recommendation model (DLRM) running with CriteoKaggle dataset, we achieve 3x memory reduction with INT8 precision embedding tables and full-precision cache.
For an industrial scale model and dataset, we achieve even higher >7x memory reduction with INT4 precision and cache size 1% of embedding tables.
- Score: 3.0298877977523144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recommendation systems, practitioners observed that increase in the number
of embedding tables and their sizes often leads to significant improvement in
model performances. Given this and the business importance of these models to
major internet companies, embedding tables for personalization tasks have grown
to terabyte scale and continue to grow at a significant rate. Meanwhile, these
large-scale models are often trained with GPUs where high-performance memory is
a scarce resource, thus motivating numerous work on embedding table compression
during training. We propose a novel change to embedding tables using a cache
memory architecture, where the majority of rows in an embedding is trained in
low precision, and the most frequently or recently accessed rows cached and
trained in full precision. The proposed architectural change works in
conjunction with standard precision reduction and computer arithmetic
techniques such as quantization and stochastic rounding. For an open source
deep learning recommendation model (DLRM) running with Criteo-Kaggle dataset,
we achieve 3x memory reduction with INT8 precision embedding tables and
full-precision cache whose size are 5% of the embedding tables, while
maintaining accuracy. For an industrial scale model and dataset, we achieve
even higher >7x memory reduction with INT4 precision and cache size 1% of
embedding tables, while maintaining accuracy, and 16% end-to-end training
speedup by reducing GPU-to-host data transfers.
Related papers
- DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies.
The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models.
We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Fine-Grained Embedding Dimension Optimization During Training for Recommender Systems [17.602059421895856]
FIITED is a system to automatically reduce the memory footprint via FIne-grained In-Training Embedding Dimension pruning.
We show that FIITED can reduce DLRM embedding size by more than 65% while preserving model quality.
On public datasets, FIITED can reduce the size of embedding tables by 2.1x to 800x with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-09T08:04:11Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Rediscovering Hashed Random Projections for Efficient Quantization of
Contextualized Sentence Embeddings [113.38884267189871]
Training and inference on edge devices often requires an efficient setup due to computational limitations.
Pre-computing data representations and caching them on a server can mitigate extensive edge device computation.
We propose a simple, yet effective approach that uses randomly hyperplane projections.
We show that the embeddings remain effective for training models across various English and German sentence classification tasks that retain 94%--99% of their floating-point.
arXiv Detail & Related papers (2023-03-13T10:53:00Z) - Efficient Fine-Tuning of BERT Models on the Edge [12.768368718187428]
We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models.
FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%.
More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.
arXiv Detail & Related papers (2022-05-03T14:51:53Z) - HET: Scaling out Huge Embedding Model Training via Cache-enabled
Distributed Framework [17.114812060566766]
We propose HET, a new system framework that significantly improves the scalability of huge embedding model training.
HET achieves up to 88% embedding communication reductions and up to 20.68x performance speedup over the state-of-the-art baselines.
arXiv Detail & Related papers (2021-12-14T08:18:10Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy.
At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy.
This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.