DKM: Differentiable K-Means Clustering Layer for Neural Network
Compression
- URL: http://arxiv.org/abs/2108.12659v1
- Date: Sat, 28 Aug 2021 14:35:41 GMT
- Title: DKM: Differentiable K-Means Clustering Layer for Neural Network
Compression
- Authors: Minsik Cho, Keivan A. Vahid, Saurabh Adya, Mohammad Rastegari
- Abstract summary: We propose a differentiable k-means clustering layer (DKM) to train-time weight clustering-based model compression.
DKM casts k-means clustering as an attention problem and enables joint optimization of the parameters and clustering centroids.
We show that DKM delivers superior compression and accuracy trade-off on ImageNet1k and GLUE benchmarks.
- Score: 20.73169804006698
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Deep neural network (DNN) model compression for efficient on-device inference
is becoming increasingly important to reduce memory requirements and keep user
data on-device. To this end, we propose a novel differentiable k-means
clustering layer (DKM) and its application to train-time weight
clustering-based DNN model compression. DKM casts k-means clustering as an
attention problem and enables joint optimization of the parameters and
clustering centroids. Unlike prior works that rely on additional regularizers
and parameters, DKM-based compression keeps the original loss function and
model architecture fixed. We evaluated DKM-based compression on various DNN
models for computer vision and natural language processing (NLP) tasks. Our
results demonstrate that DMK delivers superior compression and accuracy
trade-off on ImageNet1k and GLUE benchmarks. For example, DKM-based compression
can offer 74.5% top-1 ImageNet1k accuracy on ResNet50 DNN model with 3.3MB
model size (29.4x model compression factor). For MobileNet-v1, which is a
challenging DNN to compress, DKM delivers 62.8% top-1 ImageNet1k accuracy with
0.74 MB model size (22.4x model compression factor). This result is 6.8% higher
top-1 accuracy and 33% relatively smaller model size than the current
state-of-the-art DNN compression algorithms. Additionally, DKM enables
compression of DistilBERT model by 11.8x with minimal (1.1%) accuracy loss on
GLUE NLP benchmarks.
Related papers
- Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models [64.34635279436054]
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing.
We present a solution to this memory problem, in form of a new compression and execution framework called QMoE.
arXiv Detail & Related papers (2023-10-25T17:24:53Z) - Rotation Invariant Quantization for Model Compression [7.633595230914364]
Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources.
We suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model.
arXiv Detail & Related papers (2023-03-03T10:53:30Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - Toward Compact Parameter Representations for Architecture-Agnostic
Neural Network Compression [26.501979992447605]
This paper investigates compression from the perspective of compactly representing and storing trained parameters.
We leverage additive quantization, an extreme lossy compression method invented for image descriptors, to compactly represent the parameters.
We conduct experiments on MobileNet-v2, VGG-11, ResNet-50, Feature Pyramid Networks, and pruned DNNs trained for classification, detection, and segmentation tasks.
arXiv Detail & Related papers (2021-11-19T17:03:11Z) - Towards Efficient Tensor Decomposition-Based DNN Model Compression with
Optimization Framework [14.27609385208807]
We propose a systematic framework for tensor decomposition-based model compression using Alternating Direction Method of Multipliers (ADMM)
Our framework is very general, and it works for both CNNs and RNNs.
Experimental results show that our ADMM-based TT-format models demonstrate very high compression performance with high accuracy.
arXiv Detail & Related papers (2021-07-26T18:31:33Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z) - Kernel Quantization for Efficient Network Compression [59.55192551370948]
Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss.
Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level.
Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
arXiv Detail & Related papers (2020-03-11T08:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.