End-to-End Neural Network Compression via $\frac{\ell_1}{\ell_2}$
Regularized Latency Surrogates
- URL: http://arxiv.org/abs/2306.05785v2
- Date: Tue, 13 Jun 2023 06:41:47 GMT
- Title: End-to-End Neural Network Compression via $\frac{\ell_1}{\ell_2}$
Regularized Latency Surrogates
- Authors: Anshul Nasery, Hardik Shah, Arun Sai Suggala, Prateek Jain
- Abstract summary: Our algorithm is versatile and can be used with many popular compression methods including pruning, low-rank factorization, and quantization.
It is fast and runs in almost the same amount of time as single model training.
- Score: 20.31383698391339
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural network (NN) compression via techniques such as pruning, quantization
requires setting compression hyperparameters (e.g., number of channels to be
pruned, bitwidths for quantization) for each layer either manually or via
neural architecture search (NAS) which can be computationally expensive. We
address this problem by providing an end-to-end technique that optimizes for
model's Floating Point Operations (FLOPs) or for on-device latency via a novel
$\frac{\ell_1}{\ell_2}$ latency surrogate. Our algorithm is versatile and can
be used with many popular compression methods including pruning, low-rank
factorization, and quantization. Crucially, it is fast and runs in almost the
same amount of time as single model training; which is a significant training
speed-up over standard NAS methods. For BERT compression on GLUE fine-tuning
tasks, we achieve $50\%$ reduction in FLOPs with only $1\%$ drop in
performance. For compressing MobileNetV3 on ImageNet-1K, we achieve $15\%$
reduction in FLOPs, and $11\%$ reduction in on-device latency without drop in
accuracy, while still requiring $3\times$ less training compute than SOTA
compression techniques. Finally, for transfer learning on smaller datasets, our
technique identifies $1.2\times$-$1.4\times$ cheaper architectures than
standard MobileNetV3, EfficientNet suite of architectures at almost the same
training cost and accuracy.
Related papers
- Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks [15.519170283930276]
We propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously.
Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices.
Our large FasterNet-L achieves impressive $83.5%$ top-1 accuracy, on par with the emerging Swin-B, while having $36%$ higher inference throughput on GPU.
arXiv Detail & Related papers (2023-03-07T06:05:30Z) - Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic
Programming [15.458305667190256]
We propose a novel depth compression algorithm which targets general convolution operations.
We achieve $1.41times$ speed-up with $0.11%p accuracy gain in MobileNetV2-1.0 on the ImageNet.
arXiv Detail & Related papers (2023-01-28T13:08:54Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - An Information Theory-inspired Strategy for Automatic Network Pruning [88.51235160841377]
Deep convolution neural networks are well known to be compressed on devices with resource constraints.
Most existing network pruning methods require laborious human efforts and prohibitive computation resources.
We propose an information theory-inspired strategy for automatic model compression.
arXiv Detail & Related papers (2021-08-19T07:03:22Z) - NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural
Architecture Search [100.71365025972258]
We propose NAS-BERT, an efficient method for BERT compression.
NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency.
Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
arXiv Detail & Related papers (2021-05-30T07:20:27Z) - Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance.
Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z) - Layer-Wise Data-Free CNN Compression [49.73757297936685]
We show how to generate layer-wise training data using only a pretrained network.
We present results for layer-wise compression using quantization and pruning.
arXiv Detail & Related papers (2020-11-18T03:00:05Z) - Learned Threshold Pruning [15.394473766381518]
Our method learns per-layer thresholds via gradient descent, unlike conventional methods where they are set as input.
It takes $30$ epochs for tuning to prune ResNet50 on ImageNet by a factor of $9.1$.
We also show that tuning effectively prunes modern textitcompactthreshold architectures such as EfficientNet, MobileNetV2 and MixNet.
arXiv Detail & Related papers (2020-02-28T21:32:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.