The trade-offs of model size in large recommendation models : A 10000
$\times$ compressed criteo-tb DLRM model (100 GB parameters to mere 10MB)
- URL: http://arxiv.org/abs/2207.10731v1
- Date: Thu, 21 Jul 2022 19:50:34 GMT
- Title: The trade-offs of model size in large recommendation models : A 10000
$\times$ compressed criteo-tb DLRM model (100 GB parameters to mere 10MB)
- Authors: Aditya Desai, Anshumali Shrivastava
- Abstract summary: Embedding tables dominate industrial-scale recommendation model sizes, using up to terabytes of memory.
This paper analyzes and extensively evaluates a generic parameter sharing setup (PSS) for compressing DLRM models.
We show that scales are tipped towards having a smaller DLRM model, leading to faster inference, easier deployment, and similar training times.
- Score: 40.623439224839245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embedding tables dominate industrial-scale recommendation model sizes, using
up to terabytes of memory. A popular and the largest publicly available machine
learning MLPerf benchmark on recommendation data is a Deep Learning
Recommendation Model (DLRM) trained on a terabyte of click-through data. It
contains 100GB of embedding memory (25+Billion parameters). DLRMs, due to their
sheer size and the associated volume of data, face difficulty in training,
deploying for inference, and memory bottlenecks due to large embedding tables.
This paper analyzes and extensively evaluates a generic parameter sharing setup
(PSS) for compressing DLRM models. We show theoretical upper bounds on the
learnable memory requirements for achieving $(1 \pm \epsilon)$ approximations
to the embedding table. Our bounds indicate exponentially fewer parameters
suffice for good accuracy. To this end, we demonstrate a PSS DLRM reaching
10000$\times$ compression on criteo-tb without losing quality. Such a
compression, however, comes with a caveat. It requires 4.5 $\times$ more
iterations to reach the same saturation quality. The paper argues that this
tradeoff needs more investigations as it might be significantly favorable.
Leveraging the small size of the compressed model, we show a 4.3$\times$
improvement in training latency leading to similar overall training times.
Thus, in the tradeoff between system advantage of a small DLRM model vs. slower
convergence, we show that scales are tipped towards having a smaller DLRM
model, leading to faster inference, easier deployment, and similar training
times.
Related papers
- DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies.
The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models.
We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z) - Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [10.233937665979694]
DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications.
A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices.
We introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training.
arXiv Detail & Related papers (2024-07-05T05:55:18Z) - UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture [6.5386984667643695]
UpDLRM uses real-world processingin-memory hardware, UPMEM DPU, to boost the memory bandwidth and reduce recommendation latency.
UpDLRM achieves much lower inference time for DLRM compared to both CPU-only and CPU-GPU hybrid counterparts.
arXiv Detail & Related papers (2024-06-20T02:20:21Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z) - MTrainS: Improving DLRM training efficiency using heterogeneous memories [5.195887979684162]
In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth.
In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models.
We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically.
arXiv Detail & Related papers (2023-04-19T06:06:06Z) - Petals: Collaborative Inference and Fine-tuning of Large Models [78.37798144357977]
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters.
With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale.
We propose Petals $-$ a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties.
arXiv Detail & Related papers (2022-09-02T17:38:03Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf
DLRM Model : 1000$\times$ Compression and 2.7$\times$ Faster Inference [33.66462823637363]
State-the-art recommendation models are one of the largest models rivalling the likes of GPT-3 and Switch Transformer.
Deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical values.
Model compression for DLRM is gaining traction and the community has recently shown impressive compression results.
arXiv Detail & Related papers (2021-08-04T17:28:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.