Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf
DLRM Model : 1000$\times$ Compression and 2.7$\times$ Faster Inference
- URL: http://arxiv.org/abs/2108.02191v1
- Date: Wed, 4 Aug 2021 17:28:45 GMT
- Title: Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf
DLRM Model : 1000$\times$ Compression and 2.7$\times$ Faster Inference
- Authors: Aditya Desai, Li Chou, Anshumali Shrivastava
- Abstract summary: State-the-art recommendation models are one of the largest models rivalling the likes of GPT-3 and Switch Transformer.
Deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical values.
Model compression for DLRM is gaining traction and the community has recently shown impressive compression results.
- Score: 33.66462823637363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning for recommendation data is the one of the most pervasive and
challenging AI workload in recent times. State-of-the-art recommendation models
are one of the largest models rivalling the likes of GPT-3 and Switch
Transformer. Challenges in deep learning recommendation models (DLRM) stem from
learning dense embeddings for each of the categorical values. These embedding
tables in industrial scale models can be as large as hundreds of terabytes.
Such large models lead to a plethora of engineering challenges, not to mention
prohibitive communication overheads, and slower training and inference times.
Of these, slower inference time directly impacts user experience. Model
compression for DLRM is gaining traction and the community has recently shown
impressive compression results. In this paper, we present Random Offset Block
Embedding Array (ROBE) as a low memory alternative to embedding tables which
provide orders of magnitude reduction in memory usage while maintaining
accuracy and boosting execution speed. ROBE is a simple fundamental approach in
improving both cache performance and the variance of randomized hashing, which
could be of independent interest in itself. We demonstrate that we can
successfully train DLRM models with same accuracy while using $1000 \times$
less memory. A $1000\times$ compressed model directly results in faster
inference without any engineering. In particular, we show that we can train
DLRM model using ROBE Array of size 100MB on a single GPU to achieve AUC of
0.8025 or higher as required by official MLPerf CriteoTB benchmark DLRM model
of 100GB while achieving about $2.7\times$ (170\%) improvement in inference
throughput.
Related papers
- DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies.
The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models.
We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z) - Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [10.233937665979694]
DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications.
A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices.
We introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training.
arXiv Detail & Related papers (2024-07-05T05:55:18Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models [64.34635279436054]
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing.
We present a solution to this memory problem, in form of a new compression and execution framework called QMoE.
arXiv Detail & Related papers (2023-10-25T17:24:53Z) - Petals: Collaborative Inference and Fine-tuning of Large Models [78.37798144357977]
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters.
With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale.
We propose Petals $-$ a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties.
arXiv Detail & Related papers (2022-09-02T17:38:03Z) - The trade-offs of model size in large recommendation models : A 10000
$\times$ compressed criteo-tb DLRM model (100 GB parameters to mere 10MB) [40.623439224839245]
Embedding tables dominate industrial-scale recommendation model sizes, using up to terabytes of memory.
This paper analyzes and extensively evaluates a generic parameter sharing setup (PSS) for compressing DLRM models.
We show that scales are tipped towards having a smaller DLRM model, leading to faster inference, easier deployment, and similar training times.
arXiv Detail & Related papers (2022-07-21T19:50:34Z) - Efficient model compression with Random Operation Access Specific Tile
(ROAST) hashing [35.67591281350068]
This paper proposes a model-agnostic, cache-friendly model compression approach: Random Operation Access Specific Tile (ROAST) hashing.
With ROAST, we present the first compressed BERT, which is $100times - 1000times$ smaller but does not result in quality degradation.
These compression levels on universal architecture like transformers are promising for the future of SOTA model deployment on resource-constrained devices like mobile and edge devices.
arXiv Detail & Related papers (2022-07-21T18:31:17Z) - NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural
Architecture Search [100.71365025972258]
We propose NAS-BERT, an efficient method for BERT compression.
NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency.
Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
arXiv Detail & Related papers (2021-05-30T07:20:27Z) - TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models [5.577715465378262]
Memory capacity of embedding tables in deep learning recommendation models (DLRMs) is increasing dramatically.
We show the potential of Train decomposition for DLRMs (TT-Rec)
We evaluate TT-Rec across three important design dimensions -- memory capacity, accuracy and timing performance.
arXiv Detail & Related papers (2021-01-25T23:19:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.