Deep learning model compression using network sensitivity and gradients
- URL: http://arxiv.org/abs/2210.05111v1
- Date: Tue, 11 Oct 2022 03:02:40 GMT
- Title: Deep learning model compression using network sensitivity and gradients
- Authors: Madhumitha Sakthi, Niranjan Yadla, Raj Pawate
- Abstract summary: We present model compression algorithms for both non-retraining and retraining conditions.
In the first case, we propose the Bin & Quant algorithm for compression of the deep learning models using the sensitivity of the network parameters.
In the second case, we propose our novel gradient-weighted k-means clustering algorithm (GWK)
- Score: 3.52359746858894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning model compression is an improving and important field for the
edge deployment of deep learning models. Given the increasing size of the
models and their corresponding power consumption, it is vital to decrease the
model size and compute requirement without a significant drop in the model's
performance. In this paper, we present model compression algorithms for both
non-retraining and retraining conditions. In the first case where retraining of
the model is not feasible due to lack of access to the original data or absence
of necessary compute resources while only having access to off-the-shelf
models, we propose the Bin & Quant algorithm for compression of the deep
learning models using the sensitivity of the network parameters. This results
in 13x compression of the speech command and control model and 7x compression
of the DeepSpeech2 models. In the second case when the models can be retrained
and utmost compression is required for the negligible loss in accuracy, we
propose our novel gradient-weighted k-means clustering algorithm (GWK). This
method uses the gradients in identifying the important weight values in a given
cluster and nudges the centroid towards those values, thereby giving importance
to sensitive weights. Our method effectively combines product quantization with
the EWGS[1] algorithm for sub-1-bit representation of the quantized models. We
test our GWK algorithm on the CIFAR10 dataset across a range of models such as
ResNet20, ResNet56, MobileNetv2 and show 35x compression on quantized models
for less than 2% absolute loss in accuracy compared to the floating-point
models.
Related papers
- Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Rotation Invariant Quantization for Model Compression [7.633595230914364]
Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources.
We suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model.
arXiv Detail & Related papers (2023-03-03T10:53:30Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - Online Model Compression for Federated Learning with Large Models [8.48327410170884]
Online Model Compression (OMC) is a framework that stores model parameters in a compressed format and decompresses them only when needed.
OMC can reduce memory usage and communication cost of model parameters by up to 59% while attaining comparable accuracy and training speed when compared with full-precision training.
arXiv Detail & Related papers (2022-05-06T22:43:03Z) - LCS: Learning Compressible Subspaces for Adaptive Network Compression at
Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z) - Investigating the Relationship Between Dropout Regularization and Model
Complexity in Neural Networks [0.0]
Dropout Regularization serves to reduce variance in Deep Learning models.
We explore the relationship between the dropout rate and model complexity by training 2,000 neural networks.
We build neural networks that predict the optimal dropout rate given the number of hidden units in each dense layer.
arXiv Detail & Related papers (2021-08-14T23:49:33Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z) - Compression of descriptor models for mobile applications [26.498907514590165]
We evaluate the computational cost, model size, and matching accuracy tradeoffs for deep neural networks.
We observe a significant redundancy in the learned weights, which we exploit through the use of depthwise separable layers.
We propose the Convolution-Depthwise-Pointwise(CDP) layer, which provides a means of interpolating between the standard and depthwise separable convolutions.
arXiv Detail & Related papers (2020-01-09T17:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.