One Weight Bitwidth to Rule Them All
- URL: http://arxiv.org/abs/2008.09916v2
- Date: Fri, 28 Aug 2020 18:49:48 GMT
- Title: One Weight Bitwidth to Rule Them All
- Authors: Ting-Wu Chin, Pierce I-Jen Chuang, Vikas Chandra, Diana Marculescu
- Abstract summary: We show that using a single bitwidth for the whole network can achieve better accuracy compared to mixed-precision quantization.
Our results suggest that when the number of channels becomes a target hyper parameter, a single weight bitwidth throughout the network shows superior results for model compression.
- Score: 24.373061354080825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weight quantization for deep ConvNets has shown promising results for
applications such as image classification and semantic segmentation and is
especially important for applications where memory storage is limited. However,
when aiming for quantization without accuracy degradation, different tasks may
end up with different bitwidths. This creates complexity for software and
hardware support and the complexity accumulates when one considers
mixed-precision quantization, in which case each layer's weights use a
different bitwidth. Our key insight is that optimizing for the least bitwidth
subject to no accuracy degradation is not necessarily an optimal strategy. This
is because one cannot decide optimality between two bitwidths if one has a
smaller model size while the other has better accuracy. In this work, we take
the first step to understand if some weight bitwidth is better than others by
aligning all to the same model size using a width-multiplier. Under this
setting, somewhat surprisingly, we show that using a single bitwidth for the
whole network can achieve better accuracy compared to mixed-precision
quantization targeting zero accuracy degradation when both have the same model
size. In particular, our results suggest that when the number of channels
becomes a target hyperparameter, a single weight bitwidth throughout the
network shows superior results for model compression.
Related papers
- SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - A Practical Mixed Precision Algorithm for Post-Training Quantization [15.391257986051249]
Mixed-precision quantization is a promising solution to find a better performance-efficiency trade-off than homogeneous quantization.
We present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset.
We show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
arXiv Detail & Related papers (2023-02-10T17:47:54Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - Channel-wise Mixed-precision Assignment for DNN Inference on Constrained
Edge Nodes [22.40937602825472]
State-of-the-art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer.
We propose a novel NAS that selects the bit-width of each weight tensor channel independently.
Our networks reduce the memory and energy for inference by up to 63% and 27% respectively.
arXiv Detail & Related papers (2022-06-17T15:51:49Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Pruning Ternary Quantization [32.32812780843498]
Inference time, model size, and accuracy are three key factors in deep model compression.
We propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method.
Our method is verified on image classification, object detection/segmentation tasks with different network structures.
arXiv Detail & Related papers (2021-07-23T02:18:00Z) - Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance.
Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z) - FracBits: Mixed Precision Quantization via Fractional Bit-Widths [29.72454879490227]
Mixed precision quantization is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths.
We propose a novel learning-based algorithm to derive mixed precision models end-to-end under target computation constraints.
arXiv Detail & Related papers (2020-07-04T06:09:09Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z) - Rethinking Differentiable Search for Mixed-Precision Neural Networks [83.55785779504868]
Low-precision networks with weights and activations quantized to low bit-width are widely used to accelerate inference on edge devices.
Current solutions are uniform, using identical bit-width for all filters.
This fails to account for the different sensitivities of different filters and is suboptimal.
Mixed-precision networks address this problem, by tuning the bit-width to individual filter requirements.
arXiv Detail & Related papers (2020-04-13T07:02:23Z) - Efficient Bitwidth Search for Practical Mixed Precision Neural Network [33.80117489791902]
Network quantization has rapidly become one of the most widely used methods to compress and accelerate deep neural networks.
Recent efforts propose to quantize weights and activations from different layers with different precision to improve the overall performance.
It is challenging to find the optimal bitwidth (i.e., precision) for weights and activations of each layer efficiently.
It is yet unclear how to perform convolution for weights and activations of different precision efficiently on generic hardware platforms.
arXiv Detail & Related papers (2020-03-17T08:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.