TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models
- URL: http://arxiv.org/abs/2101.11714v1
- Date: Mon, 25 Jan 2021 23:19:03 GMT
- Title: TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models
- Authors: Chunxing Yin and Bilge Acun and Xing Liu and Carole-Jean Wu
- Abstract summary: Memory capacity of embedding tables in deep learning recommendation models (DLRMs) is increasing dramatically.
We show the potential of Train decomposition for DLRMs (TT-Rec)
We evaluate TT-Rec across three important design dimensions -- memory capacity, accuracy and timing performance.
- Score: 5.577715465378262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The memory capacity of embedding tables in deep learning recommendation
models (DLRMs) is increasing dramatically from tens of GBs to TBs across the
industry. Given the fast growth in DLRMs, novel solutions are urgently needed,
in order to enable fast and efficient DLRM innovations. At the same time, this
must be done without having to exponentially increase infrastructure capacity
demands. In this paper, we demonstrate the promising potential of Tensor Train
decomposition for DLRMs (TT-Rec), an important yet under-investigated context.
We design and implement optimized kernels (TT-EmbeddingBag) to evaluate the
proposed TT-Rec design. TT-EmbeddingBag is 3 times faster than the SOTA TT
implementation. The performance of TT-Rec is further optimized with the batched
matrix multiplication and caching strategies for embedding vector lookup
operations. In addition, we present mathematically and empirically the effect
of weight initialization distribution on DLRM accuracy and propose to
initialize the tensor cores of TT-Rec following the sampled Gaussian
distribution. We evaluate TT-Rec across three important design space dimensions
-- memory capacity, accuracy, and timing performance -- by training MLPerf-DLRM
with Criteo's Kaggle and Terabyte data sets. TT-Rec achieves 117 times and 112
times model size compression, for Kaggle and Terabyte, respectively. This
impressive model size reduction can come with no accuracy nor training time
overhead as compared to the uncompressed baseline.
Related papers
- DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies.
The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models.
We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - Towards Memory- and Time-Efficient Backpropagation for Training Spiking
Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing.
We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency.
Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z) - The trade-offs of model size in large recommendation models : A 10000
$\times$ compressed criteo-tb DLRM model (100 GB parameters to mere 10MB) [40.623439224839245]
Embedding tables dominate industrial-scale recommendation model sizes, using up to terabytes of memory.
This paper analyzes and extensively evaluates a generic parameter sharing setup (PSS) for compressing DLRM models.
We show that scales are tipped towards having a smaller DLRM model, leading to faster inference, easier deployment, and similar training times.
arXiv Detail & Related papers (2022-07-21T19:50:34Z) - Provable Tensor-Train Format Tensor Completion by Riemannian
Optimization [22.166436026482984]
We provide the first theoretical guarantees of the convergence of RGrad algorithm for TT-format tensor completion.
We also propose a novel approach, referred to as the sequential second-order moment method.
arXiv Detail & Related papers (2021-08-27T08:13:58Z) - Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf
DLRM Model : 1000$\times$ Compression and 2.7$\times$ Faster Inference [33.66462823637363]
State-the-art recommendation models are one of the largest models rivalling the likes of GPT-3 and Switch Transformer.
Deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical values.
Model compression for DLRM is gaining traction and the community has recently shown impressive compression results.
arXiv Detail & Related papers (2021-08-04T17:28:45Z) - Towards Efficient Tensor Decomposition-Based DNN Model Compression with
Optimization Framework [14.27609385208807]
We propose a systematic framework for tensor decomposition-based model compression using Alternating Direction Method of Multipliers (ADMM)
Our framework is very general, and it works for both CNNs and RNNs.
Experimental results show that our ADMM-based TT-format models demonstrate very high compression performance with high accuracy.
arXiv Detail & Related papers (2021-07-26T18:31:33Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - A Generic Network Compression Framework for Sequential Recommender
Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations.
We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed.
By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.