Doping: A technique for efficient compression of LSTM models using
sparse structured additive matrices
- URL: http://arxiv.org/abs/2102.07071v1
- Date: Sun, 14 Feb 2021 05:14:09 GMT
- Title: Doping: A technique for efficient compression of LSTM models using
sparse structured additive matrices
- Authors: Urmish Thakker, Paul N. Whatmough, Zhigang Liu, Matthew Mattina, Jesse
Beu
- Abstract summary: We propose the notion of doping -- addition of an extremely sparse matrix to a structured matrix.
Doping facilitates additional degrees of freedom for a small number of parameters, allowing them to independently diverge from the fixed structure.
We show that doped KP compression technique outperforms previous state-of-the art compression results by achieving 1.3 - 2.4x higher compression factor at a similar accuracy.
- Score: 14.321761305835972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Structured matrices, such as those derived from Kronecker products (KP), are
effective at compressing neural networks, but can lead to unacceptable accuracy
loss when applied to large models. In this paper, we propose the notion of
doping -- addition of an extremely sparse matrix to a structured matrix. Doping
facilitates additional degrees of freedom for a small number of parameters,
allowing them to independently diverge from the fixed structure. To train LSTMs
with doped structured matrices, we introduce the additional parameter matrix
while slowly annealing its sparsity level. However, we find that performance
degrades as we slowly sparsify the doping matrix, due to co-matrix adaptation
(CMA) between the structured and the sparse matrices. We address this over
dependence on the sparse matrix using a co-matrix dropout regularization (CMR)
scheme. We provide empirical evidence to show that doping, CMA and CMR are
concepts generally applicable to multiple structured matrices (Kronecker
Product, LMF, Hybrid Matrix Decomposition). Additionally, results with doped
kronecker product matrices demonstrate state-of-the-art accuracy at large
compression factors (10 - 25x) across 4 natural language processing
applications with minor loss in accuracy. Doped KP compression technique
outperforms previous state-of-the art compression results by achieving 1.3 -
2.4x higher compression factor at a similar accuracy, while also beating strong
alternatives like pruning and low-rank methods by a large margin (8% or more).
Additionally, we show that doped KP can be deployed on commodity hardware using
the current software stack and achieve 2.5 - 5.5x inference run-time speed-up
over baseline.
Related papers
- From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models.
We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate.
We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z) - Common Subexpression-based Compression and Multiplication of Sparse
Constant Matrices [0.0]
This paper presents a compression format by extending the Compressed Sparse Row (CSR) to include CSEs.
It produces an adder tree for a $1000 times 1000$ matrix in a minute.
simulations for a single-core embedded system show that the matrix multiplication execution time can be reduced by $20%$.
arXiv Detail & Related papers (2023-03-26T22:14:15Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Exact Decomposition of Joint Low Rankness and Local Smoothness Plus
Sparse Matrices [39.47324019377441]
We propose a new RPCA model based on three-dimensional correlated total variation regularization (3DCTV-RPCA for short)
We prove that under some mild assumptions, the proposed 3DCTV-RPCA model can decompose both components exactly.
arXiv Detail & Related papers (2022-01-29T13:58:03Z) - Robust 1-bit Compressive Sensing with Partial Gaussian Circulant
Matrices and Generative Priors [54.936314353063494]
We provide recovery guarantees for a correlation-based optimization algorithm for robust 1-bit compressive sensing.
We make use of a practical iterative algorithm, and perform numerical experiments on image datasets to corroborate our results.
arXiv Detail & Related papers (2021-08-08T05:28:06Z) - Dynamic Probabilistic Pruning: A general framework for
hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps)
We refer to this algorithm as Dynamic Probabilistic Pruning (DPP)
We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z) - Rank and run-time aware compression of NLP Applications [12.965657113072325]
This paper proposes a new compression technique called Hybrid Matrix Factorization.
It improves low-rank matrix factorization techniques by doubling the rank of the matrix.
It can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
arXiv Detail & Related papers (2020-10-06T16:03:15Z) - A Generic Network Compression Framework for Sequential Recommender
Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations.
We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed.
By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z) - Compressing Language Models using Doped Kronecker Products [16.64452087806598]
This paper proposes a way to recover accuracy otherwise lost when applying KP to large NLP tasks.
We call this compression method doped kronecker product compression.
We present experimental results that demonstrate compression of a large language model with LSTM layers of size 25 MB by 25x with 1.4% loss in perplexity score.
arXiv Detail & Related papers (2020-01-24T06:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.