Efficient model compression with Random Operation Access Specific Tile
(ROAST) hashing
- URL: http://arxiv.org/abs/2207.10702v1
- Date: Thu, 21 Jul 2022 18:31:17 GMT
- Title: Efficient model compression with Random Operation Access Specific Tile
(ROAST) hashing
- Authors: Aditya Desai, Keren Zhou, Anshumali Shrivastava
- Abstract summary: This paper proposes a model-agnostic, cache-friendly model compression approach: Random Operation Access Specific Tile (ROAST) hashing.
With ROAST, we present the first compressed BERT, which is $100times - 1000times$ smaller but does not result in quality degradation.
These compression levels on universal architecture like transformers are promising for the future of SOTA model deployment on resource-constrained devices like mobile and edge devices.
- Score: 35.67591281350068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in deep learning are often associated with increasing model
sizes. The model size dramatically affects the deployment cost and latency of
deep models. For instance, models like BERT cannot be deployed on edge devices
and mobiles due to their sheer size. As a result, most advances in Deep
Learning are yet to reach the edge. Model compression has sought much-deserved
attention in literature across natural language processing, vision, and
recommendation domains. This paper proposes a model-agnostic, cache-friendly
model compression approach: Random Operation Access Specific Tile (ROAST)
hashing. ROAST collapses the parameters by clubbing them through a lightweight
mapping. Notably, while clubbing these parameters, ROAST utilizes cache
hierarchies by aligning the memory access pattern with the parameter access
pattern. ROAST is up to $\sim 25 \times$ faster to train and $\sim 50 \times$
faster to infer than the popular parameter sharing method HashedNet.
Additionally, ROAST introduces global weight sharing, which is empirically and
theoretically superior to local weight sharing in HashedNet, and can be of
independent interest in itself. With ROAST, we present the first compressed
BERT, which is $100\times - 1000\times$ smaller but does not result in quality
degradation. These compression levels on universal architecture like
transformers are promising for the future of SOTA model deployment on
resource-constrained devices like mobile and edge devices
Related papers
- A 7K Parameter Model for Underwater Image Enhancement based on Transmission Map Prior [13.453441079833627]
Deep learning models for underwater image enhancement face limitations in both lightweight and effectiveness.
In this paper, a lightweight network named lightweight selective attention network (LSNet) is proposed.
The proposed model achieves a PSNR of 97% with only 7K parameters compared to a similar attention-based model.
arXiv Detail & Related papers (2024-05-25T11:58:24Z) - TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition [19.897367559948336]
We propose a training-free model compression approach based on the Matrix-Train Decomposition (TTD)
We then investigate the low-rank structures extracted by this approach, in terms of the compression ratio, the language task performance, and latency on a typical low-end device (i.e. Raspberry Pi)
arXiv Detail & Related papers (2023-07-02T09:33:09Z) - Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery
Tickets from Large Models [106.19385911520652]
Lottery Ticket Hypothesis (LTH) and its variants have been exploited to prune large pre-trained models generating parameterworks.
LTH is enormously inhibited by repetitive full training and pruning routine of iterative magnitude pruning (IMP)
We propose Instant Soup Pruning (ISP) to generate lottery ticket quality IMPworks.
arXiv Detail & Related papers (2023-06-18T03:09:52Z) - ZipLM: Inference-Aware Structured Pruning of Language Models [56.52030193434863]
We propose a novel structured compression approach for large language models (LLMs) called ZipLM.
ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups.
ZipLM produces state-of-the-art compressed models across all settings.
arXiv Detail & Related papers (2023-02-07T18:55:28Z) - Learning to Collide: Recommendation System Model Compression with
Learned Hash Functions [4.6994057182972595]
A key characteristic of deep recommendation models is the immense memory requirements of their embedding tables.
A common technique to reduce model size is to hash all of the categorical variable identifiers (ids) into a smaller space.
This hashing reduces the number of unique representations that must be stored in the embedding table; thus decreasing its size.
We introduce an alternative approach, Learned Hash Functions, which instead learns a new mapping function that encourages collisions between semantically similar ids.
arXiv Detail & Related papers (2022-03-28T06:07:30Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf
DLRM Model : 1000$\times$ Compression and 2.7$\times$ Faster Inference [33.66462823637363]
State-the-art recommendation models are one of the largest models rivalling the likes of GPT-3 and Switch Transformer.
Deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical values.
Model compression for DLRM is gaining traction and the community has recently shown impressive compression results.
arXiv Detail & Related papers (2021-08-04T17:28:45Z) - You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z) - ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks.
Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios.
We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture.
Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.