Related papers: Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000$\times$ Compression and 2.7$\times$ Faster Inference

Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000$\times$ Compression and 2.7$\times$ Faster Inference

URL: http://arxiv.org/abs/2108.02191v1
Date: Wed, 4 Aug 2021 17:28:45 GMT
Title: Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000$\times$ Compression and 2.7$\times$ Faster Inference
Authors: Aditya Desai, Li Chou, Anshumali Shrivastava
Abstract summary: State-the-art recommendation models are one of the largest models rivalling the likes of GPT-3 and Switch Transformer. Deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical values. Model compression for DLRM is gaining traction and the community has recently shown impressive compression results.
Score: 33.66462823637363
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning for recommendation data is the one of the most pervasive and challenging AI workload in recent times. State-of-the-art recommendation models are one of the largest models rivalling the likes of GPT-3 and Switch Transformer. Challenges in deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical values. These embedding tables in industrial scale models can be as large as hundreds of terabytes. Such large models lead to a plethora of engineering challenges, not to mention prohibitive communication overheads, and slower training and inference times. Of these, slower inference time directly impacts user experience. Model compression for DLRM is gaining traction and the community has recently shown impressive compression results. In this paper, we present Random Offset Block Embedding Array (ROBE) as a low memory alternative to embedding tables which provide orders of magnitude reduction in memory usage while maintaining accuracy and boosting execution speed. ROBE is a simple fundamental approach in improving both cache performance and the variance of randomized hashing, which could be of independent interest in itself. We demonstrate that we can successfully train DLRM models with same accuracy while using $1000 \times$ less memory. A $1000\times$ compressed model directly results in faster inference without any engineering. In particular, we show that we can train DLRM model using ROBE Array of size 100MB on a single GPU to achieve AUC of 0.8025 or higher as required by official MLPerf CriteoTB benchmark DLRM model of 100GB while achieving about $2.7\times$ (170\%) improvement in inference throughput.

Related papers

SCRec: A Scalable Computational Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models [17.602518628415776]
Deep Learning Recommendation Models (DLRMs) play a crucial role in delivering personalized content across web applications such as social networking and video streaming. With improvements in performance, the parameter size of DLRMs has grown to terabyte (TB) scales, accompanied by memory bandwidth demands exceeding TB/s levels. We propose SCRec, a scalable computational storage recommendation system that can handle TB-scale industrial DLRMs.
arXiv Detail & Related papers (2025-04-01T08:12:45Z)
A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings. Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features. Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z)
DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies. The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models. We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z)
Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [10.233937665979694]
DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. We introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training.
arXiv Detail & Related papers (2024-07-05T05:55:18Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models [64.34635279436054]
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing. We present a solution to this memory problem, in form of a new compression and execution framework called QMoE.
arXiv Detail & Related papers (2023-10-25T17:24:53Z)
Petals: Collaborative Inference and Fine-tuning of Large Models [78.37798144357977]
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale. We propose Petals $-$ a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties.
arXiv Detail & Related papers (2022-09-02T17:38:03Z)
The trade-offs of model size in large recommendation models : A 10000 $\times$ compressed criteo-tb DLRM model (100 GB parameters to mere 10MB) [40.623439224839245]
Embedding tables dominate industrial-scale recommendation model sizes, using up to terabytes of memory. This paper analyzes and extensively evaluates a generic parameter sharing setup (PSS) for compressing DLRM models. We show that scales are tipped towards having a smaller DLRM model, leading to faster inference, easier deployment, and similar training times.
arXiv Detail & Related papers (2022-07-21T19:50:34Z)
Efficient model compression with Random Operation Access Specific Tile (ROAST) hashing [35.67591281350068]
This paper proposes a model-agnostic, cache-friendly model compression approach: Random Operation Access Specific Tile (ROAST) hashing. With ROAST, we present the first compressed BERT, which is $100times - 1000times$ smaller but does not result in quality degradation. These compression levels on universal architecture like transformers are promising for the future of SOTA model deployment on resource-constrained devices like mobile and edge devices.
arXiv Detail & Related papers (2022-07-21T18:31:17Z)
NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search [100.71365025972258]
We propose NAS-BERT, an efficient method for BERT compression. NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency. Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
arXiv Detail & Related papers (2021-05-30T07:20:27Z)
TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models [5.577715465378262]
Memory capacity of embedding tables in deep learning recommendation models (DLRMs) is increasing dramatically. We show the potential of Train decomposition for DLRMs (TT-Rec) We evaluate TT-Rec across three important design dimensions -- memory capacity, accuracy and timing performance.
arXiv Detail & Related papers (2021-01-25T23:19:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.