Related papers: End-to-End Neural Network Compression via $\frac{\ell_1}{\ell

End-to-End Neural Network Compression via $\frac{\ell_1}{\ell_2}$ Regularized Latency Surrogates

URL: http://arxiv.org/abs/2306.05785v2
Date: Tue, 13 Jun 2023 06:41:47 GMT
Title: End-to-End Neural Network Compression via $\frac{\ell_1}{\ell_2}$ Regularized Latency Surrogates
Authors: Anshul Nasery, Hardik Shah, Arun Sai Suggala, Prateek Jain
Abstract summary: Our algorithm is versatile and can be used with many popular compression methods including pruning, low-rank factorization, and quantization. It is fast and runs in almost the same amount of time as single model training.
Score: 20.31383698391339
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural network (NN) compression via techniques such as pruning, quantization requires setting compression hyperparameters (e.g., number of channels to be pruned, bitwidths for quantization) for each layer either manually or via neural architecture search (NAS) which can be computationally expensive. We address this problem by providing an end-to-end technique that optimizes for model's Floating Point Operations (FLOPs) or for on-device latency via a novel $\frac{\ell_1}{\ell_2}$ latency surrogate. Our algorithm is versatile and can be used with many popular compression methods including pruning, low-rank factorization, and quantization. Crucially, it is fast and runs in almost the same amount of time as single model training; which is a significant training speed-up over standard NAS methods. For BERT compression on GLUE fine-tuning tasks, we achieve $50\%$ reduction in FLOPs with only $1\%$ drop in performance. For compressing MobileNetV3 on ImageNet-1K, we achieve $15\%$ reduction in FLOPs, and $11\%$ reduction in on-device latency without drop in accuracy, while still requiring $3\times$ less training compute than SOTA compression techniques. Finally, for transfer learning on smaller datasets, our technique identifies $1.2\times$-$1.4\times$ cheaper architectures than standard MobileNetV3, EfficientNet suite of architectures at almost the same training cost and accuracy.

Related papers

Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module. We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH) In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z)
Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks [15.519170283930276]
We propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously. Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices. Our large FasterNet-L achieves impressive $83.5%$ top-1 accuracy, on par with the emerging Swin-B, while having $36%$ higher inference throughput on GPU.
arXiv Detail & Related papers (2023-03-07T06:05:30Z)
Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming [15.458305667190256]
We propose a novel depth compression algorithm which targets general convolution operations. We achieve $1.41times$ speed-up with $0.11%p accuracy gain in MobileNetV2-1.0 on the ImageNet.
arXiv Detail & Related papers (2023-01-28T13:08:54Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
An Information Theory-inspired Strategy for Automatic Network Pruning [88.51235160841377]
Deep convolution neural networks are well known to be compressed on devices with resource constraints. Most existing network pruning methods require laborious human efforts and prohibitive computation resources. We propose an information theory-inspired strategy for automatic model compression.
arXiv Detail & Related papers (2021-08-19T07:03:22Z)
NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search [100.71365025972258]
We propose NAS-BERT, an efficient method for BERT compression. NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency. Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
arXiv Detail & Related papers (2021-05-30T07:20:27Z)
Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance. Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z)
Layer-Wise Data-Free CNN Compression [49.73757297936685]
We show how to generate layer-wise training data using only a pretrained network. We present results for layer-wise compression using quantization and pruning.
arXiv Detail & Related papers (2020-11-18T03:00:05Z)
Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization. Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
Learned Threshold Pruning [15.394473766381518]
Our method learns per-layer thresholds via gradient descent, unlike conventional methods where they are set as input. It takes $30$ epochs for tuning to prune ResNet50 on ImageNet by a factor of $9.1$. We also show that tuning effectively prunes modern textitcompactthreshold architectures such as EfficientNet, MobileNetV2 and MixNet.
arXiv Detail & Related papers (2020-02-28T21:32:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.