Neural Network Compression Framework for fast model inference
- URL: http://arxiv.org/abs/2002.08679v4
- Date: Wed, 30 Dec 2020 08:17:23 GMT
- Title: Neural Network Compression Framework for fast model inference
- Authors: Alexander Kozlov and Ivan Lazarevich and Vasily Shamporov and Nikolay
Lyalyushkin and Yury Gorbachev
- Abstract summary: We present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF)
It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization.
The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code.
- Score: 59.65531492759006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work we present a new framework for neural networks compression with
fine-tuning, which we called Neural Network Compression Framework (NNCF). It
leverages recent advances of various network compression methods and implements
some of them, such as sparsity, quantization, and binarization. These methods
allow getting more hardware-friendly models which can be efficiently run on
general-purpose hardware computation units (CPU, GPU) or special Deep Learning
accelerators. We show that the developed methods can be successfully applied to
a wide range of models to accelerate the inference time while keeping the
original accuracy. The framework can be used within the training samples, which
are supplied with it, or as a standalone package that can be seamlessly
integrated into the existing training code with minimal adaptations. Currently,
a PyTorch version of NNCF is available as a part of OpenVINO Training
Extensions at https://github.com/openvinotoolkit/nncf.
Related papers
- Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors [4.95475852994362]
We propose a new form of quantization to tile neural network layers with sequences of bits to achieve sub-bit compression of binary-weighted neural networks.
We employ the approach to both fully-connected and convolutional layers, which make up the breadth of space in most neural architectures.
arXiv Detail & Related papers (2024-07-16T15:55:38Z) - LCS: Learning Compressible Subspaces for Adaptive Network Compression at
Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Compact CNN Structure Learning by Knowledge Distillation [34.36242082055978]
We propose a framework that leverages knowledge distillation along with customizable block-wise optimization to learn a lightweight CNN structure.
Our method results in a state of the art network compression while being capable of achieving better inference accuracy.
In particular, for the already compact network MobileNet_v2, our method offers up to 2x and 5.2x better model compression.
arXiv Detail & Related papers (2021-04-19T10:34:22Z) - SparseDNN: Fast Sparse Deep Learning Inference on CPUs [1.6244541005112747]
We present SparseDNN, a sparse deep learning inference engine targeting CPUs.
We show that our sparse code generator can achieve significant speedups over state-of-the-art sparse and dense libraries.
arXiv Detail & Related papers (2021-01-20T03:27:35Z) - Structured Sparsification with Joint Optimization of Group Convolution
and Channel Shuffle [117.95823660228537]
We propose a novel structured sparsification method for efficient network compression.
The proposed method automatically induces structured sparsity on the convolutional weights.
We also address the problem of inter-group communication with a learnable channel shuffle mechanism.
arXiv Detail & Related papers (2020-02-19T12:03:10Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.