Related papers: CrAM: A Compression-Aware Minimizer

CrAM: A Compression-Aware Minimizer

URL: http://arxiv.org/abs/2207.14200v4
Date: Thu, 4 May 2023 13:55:21 GMT
Title: CrAM: A Compression-Aware Minimizer
Authors: Alexandra Peste, Adrian Vladu, Eldar Kurtic, Christoph H. Lampert, Dan Alistarh
Abstract summary: We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way. CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning. CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
Score: 103.29159003723815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks (DNNs) often have to be compressed, via pruning and/or quantization, before they can be deployed in practical settings. In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning. Thus, dense models trained via CrAM should be compressible post-training, in a single step, without significant accuracy loss. Experimental results on standard benchmarks, such as residual networks for ImageNet classification and BERT models for language modelling, show that CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning: specifically, we can prune models in one-shot to 70-80% sparsity with almost no accuracy loss, and to 90% with reasonable ($\sim 1\%$) accuracy loss, which is competitive with gradual compression methods. Additionally, CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware. The code for reproducing the results is available at https://github.com/IST-DASLab/CrAM .

Related papers

Smooth Model Compression without Fine-Tuning [14.381101636079872]
We explore the impact of smooth regularization on neural network training and model compression.<n>We find that standard pruning methods often perform better when applied to smooth models.<n>Our approach enables state-of-the-art compression without any fine-tuning.
arXiv Detail & Related papers (2025-05-30T11:13:48Z)
Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP) ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run. We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data. Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z)
Deep learning model compression using network sensitivity and gradients [3.52359746858894]
We present model compression algorithms for both non-retraining and retraining conditions. In the first case, we propose the Bin & Quant algorithm for compression of the deep learning models using the sensitivity of the network parameters. In the second case, we propose our novel gradient-weighted k-means clustering algorithm (GWK)
arXiv Detail & Related papers (2022-10-11T03:02:40Z)
Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models [7.6356407698088]
Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models. We show that optimizing for flat minima consistently leads to greater compressibility of parameters compared to standard Adam optimization.
arXiv Detail & Related papers (2022-05-25T11:54:37Z)
Structured Pruning is All You Need for Pruning CNNs at Initialization [38.88730369884401]
Pruning is a popular technique for reducing the model size and computational cost of convolutional neural networks (CNNs) We propose PreCropping, a structured hardware-efficient model compression scheme. Compared to weight pruning, the proposed scheme is regular and dense in both storage and computation without sacrificing accuracy.
arXiv Detail & Related papers (2022-03-04T19:54:31Z)
Prune Once for All: Sparse Pre-Trained Language Models [0.6063525456640462]
We present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss.
arXiv Detail & Related papers (2021-11-10T15:52:40Z)
Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models. Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z)
Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance. Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.