Rank and run-time aware compression of NLP Applications
- URL: http://arxiv.org/abs/2010.03193v1
- Date: Tue, 6 Oct 2020 16:03:15 GMT
- Title: Rank and run-time aware compression of NLP Applications
- Authors: Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew
Mattina
- Abstract summary: This paper proposes a new compression technique called Hybrid Matrix Factorization.
It improves low-rank matrix factorization techniques by doubling the rank of the matrix.
It can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
- Score: 12.965657113072325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequence model based NLP applications can be large. Yet, many applications
that benefit from them run on small devices with very limited compute and
storage capabilities, while still having run-time constraints. As a result,
there is a need for a compression technique that can achieve significant
compression without negatively impacting inference run-time and task accuracy.
This paper proposes a new compression technique called Hybrid Matrix
Factorization that achieves this dual objective. HMF improves low-rank matrix
factorization (LMF) techniques by doubling the rank of the matrix using an
intelligent hybrid-structure leading to better accuracy than LMF. Further, by
preserving dense matrices, it leads to faster inference run-time than pruning
or structure matrix based compression technique. We evaluate the impact of this
technique on 5 NLP benchmarks across multiple tasks (Translation, Intent
Detection, Language Modeling) and show that for similar accuracy values and
compression factors, HMF can achieve more than 2.32x faster inference run-time
than pruning and 16.77% better accuracy than LMF.
Related papers
- In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - Accelerating Matrix Factorization by Dynamic Pruning for Fast Recommendation [0.49399484784577985]
Matrix factorization (MF) is a widely used collaborative filtering algorithm for recommendation systems (RSs)
With the dramatically increased number of users/items in current RSs, the computational complexity for training a MF model largely increases.
We propose algorithmic methods to accelerate MF, without inducing any additional computational resources.
arXiv Detail & Related papers (2024-03-18T16:27:33Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
AQLM is first scheme that is optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter.
We provide fast GPU and CPU implementations of AQLM for token generation.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate.
We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z) - LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models [9.727062803700264]
We introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication.
LUT-GEMM eliminates the resource-intensive dequantization process and reduces computational costs.
We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency.
arXiv Detail & Related papers (2022-06-20T03:48:17Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Dynamic Probabilistic Pruning: A general framework for
hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps)
We refer to this algorithm as Dynamic Probabilistic Pruning (DPP)
We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z) - Doping: A technique for efficient compression of LSTM models using
sparse structured additive matrices [14.321761305835972]
We propose the notion of doping -- addition of an extremely sparse matrix to a structured matrix.
Doping facilitates additional degrees of freedom for a small number of parameters, allowing them to independently diverge from the fixed structure.
We show that doped KP compression technique outperforms previous state-of-the art compression results by achieving 1.3 - 2.4x higher compression factor at a similar accuracy.
arXiv Detail & Related papers (2021-02-14T05:14:09Z) - ALF: Autoencoder-based Low-rank Filter-sharing for Efficient
Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF)
ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z) - A Generic Network Compression Framework for Sequential Recommender
Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations.
We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed.
By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z) - A High-Performance Implementation of Bayesian Matrix Factorization with
Limited Communication [10.639704288188767]
Matrix factorization algorithms can quantify uncertainty in their predictions and avoid over-fitting.
They have not been widely used on large-scale data because of their prohibitive computational cost.
We show that the state-of-the-art of both approaches to scalability can be combined.
arXiv Detail & Related papers (2020-04-06T11:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.