Rank and run-time aware compression of NLP Applications
- URL: http://arxiv.org/abs/2010.03193v1
- Date: Tue, 6 Oct 2020 16:03:15 GMT
- Title: Rank and run-time aware compression of NLP Applications
- Authors: Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew
Mattina
- Abstract summary: This paper proposes a new compression technique called Hybrid Matrix Factorization.
It improves low-rank matrix factorization techniques by doubling the rank of the matrix.
It can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
- Score: 12.965657113072325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequence model based NLP applications can be large. Yet, many applications
that benefit from them run on small devices with very limited compute and
storage capabilities, while still having run-time constraints. As a result,
there is a need for a compression technique that can achieve significant
compression without negatively impacting inference run-time and task accuracy.
This paper proposes a new compression technique called Hybrid Matrix
Factorization that achieves this dual objective. HMF improves low-rank matrix
factorization (LMF) techniques by doubling the rank of the matrix using an
intelligent hybrid-structure leading to better accuracy than LMF. Further, by
preserving dense matrices, it leads to faster inference run-time than pruning
or structure matrix based compression technique. We evaluate the impact of this
technique on 5 NLP benchmarks across multiple tasks (Translation, Intent
Detection, Language Modeling) and show that for similar accuracy values and
compression factors, HMF can achieve more than 2.32x faster inference run-time
than pruning and 16.77% better accuracy than LMF.
Related papers
- Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Quantization-free Lossy Image Compression Using Integer Matrix Factorization [8.009813033356478]
We introduce a variant of integer matrix factorization (IMF) to develop a novel quantization-free lossy image compression method.
IMF provides a low-rank representation of the image data as a product of two smaller factor matrices with bounded integer elements.
Our method consistently outperforms JPEG at low bit rates below 0.25 bits per pixel (bpp) and remains comparable at higher bit rates.
arXiv Detail & Related papers (2024-08-22T19:08:08Z) - Accelerating Matrix Factorization by Dynamic Pruning for Fast Recommendation [0.49399484784577985]
Matrix factorization (MF) is a widely used collaborative filtering algorithm for recommendation systems (RSs)
With the dramatically increased number of users/items in current RSs, the computational complexity for training a MF model largely increases.
We propose algorithmic methods to accelerate MF, without inducing any additional computational resources.
arXiv Detail & Related papers (2024-03-18T16:27:33Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate.
We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Dynamic Probabilistic Pruning: A general framework for
hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps)
We refer to this algorithm as Dynamic Probabilistic Pruning (DPP)
We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z) - Doping: A technique for efficient compression of LSTM models using
sparse structured additive matrices [14.321761305835972]
We propose the notion of doping -- addition of an extremely sparse matrix to a structured matrix.
Doping facilitates additional degrees of freedom for a small number of parameters, allowing them to independently diverge from the fixed structure.
We show that doped KP compression technique outperforms previous state-of-the art compression results by achieving 1.3 - 2.4x higher compression factor at a similar accuracy.
arXiv Detail & Related papers (2021-02-14T05:14:09Z) - ALF: Autoencoder-based Low-rank Filter-sharing for Efficient
Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF)
ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z) - A High-Performance Implementation of Bayesian Matrix Factorization with
Limited Communication [10.639704288188767]
Matrix factorization algorithms can quantify uncertainty in their predictions and avoid over-fitting.
They have not been widely used on large-scale data because of their prohibitive computational cost.
We show that the state-of-the-art of both approaches to scalability can be combined.
arXiv Detail & Related papers (2020-04-06T11:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.