Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression
- URL: http://arxiv.org/abs/2502.16638v1
- Date: Sun, 23 Feb 2025 16:28:18 GMT
- Title: Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression
- Authors: Xiaoyi Qu, David Aponte, Colby Banbury, Daniel P. Robinson, Tianyu Ding, Kazuhito Koishida, Ilya Zharkov, Tianyi Chen,
- Abstract summary: Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs)<n>Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models.<n>We present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs.
- Score: 44.35542987414442
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs) and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs. GETA introduces three key innovations: (i) a quantization-aware dependency graph (QADG) that constructs a pruning search space for generic quantization-aware DNN, (ii) a partially projected stochastic gradient method that guarantees layerwise bit constraints are satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to existing joint pruning and quantization methods.
Related papers
- Unified Stochastic Framework for Neural Network Quantization and Pruning [11.721939479875271]
This paper introduces a unified framework for post-training quantization and pruning using path-following algorithms.<n>Our approach builds on the Path Following Quantization (SPFQ) method, extending its applicability to pruning and low-bit quantization regimes.
arXiv Detail & Related papers (2024-12-24T05:38:01Z) - Quantized Approximately Orthogonal Recurrent Neural Networks [6.524758376347808]
We explore the quantization of the weight matrices in ORNNs, leading to Quantized approximately Orthogonal RNNs (QORNNs)
We propose and investigate two strategies to learn QORNN by combining quantization-aware training (QAT) and computation projections.
The most efficient models achieve results similar to state-of-the-art full-precision ORNN, LSTM and FastRNN on a variety of standard benchmarks, even with 4-bits quantization.
arXiv Detail & Related papers (2024-02-05T09:59:57Z) - OTOv3: Automatic Architecture-Agnostic Neural Network Training and
Compression from Structured Pruning to Erasing Operators [57.145175475579315]
This topic spans various techniques, from structured pruning to neural architecture search, encompassing both pruning and erasing operators perspectives.
We introduce the third-generation Only-Train-Once (OTOv3), which first automatically trains and compresses a general DNN through pruning and erasing operations.
Our empirical results demonstrate the efficacy of OTOv3 across various benchmarks in structured pruning and neural architecture search.
arXiv Detail & Related papers (2023-12-15T00:22:55Z) - HKNAS: Classification of Hyperspectral Imagery Based on Hyper Kernel
Neural Architecture Search [104.45426861115972]
We propose to directly generate structural parameters by utilizing the specifically designed hyper kernels.
We obtain three kinds of networks to separately conduct pixel-level or image-level classifications with 1-D or 3-D convolutions.
A series of experiments on six public datasets demonstrate that the proposed methods achieve state-of-the-art results.
arXiv Detail & Related papers (2023-04-23T17:27:40Z) - AutoQNN: An End-to-End Framework for Automatically Quantizing Neural
Networks [6.495218751128902]
We propose an end-to-end framework named AutoQNN, for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor.
QPL is the first method to learn mixed-precision policies by re parameterizing the bitwidths of quantizing schemes.
QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention.
arXiv Detail & Related papers (2023-04-07T11:14:21Z) - Training Multi-bit Quantized and Binarized Networks with A Learnable
Symmetric Quantizer [1.9659095632676098]
Quantizing weights and activations of deep neural networks is essential for deploying them in resource-constrained devices or cloud platforms.
While binarization is a special case of quantization, this extreme case often leads to several training difficulties.
We develop a unified quantization framework, denoted as UniQ, to overcome binarization difficulties.
arXiv Detail & Related papers (2021-04-01T02:33:31Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z) - Optimal Gradient Quantization Condition for Communication-Efficient
Distributed Training [99.42912552638168]
Communication of gradients is costly for training deep neural networks with multiple devices in computer vision applications.
In this work, we deduce the optimal condition of both the binary and multi-level gradient quantization for textbfANY gradient distribution.
Based on the optimal condition, we develop two novel quantization schemes: biased BinGrad and unbiased ORQ for binary and multi-level gradient quantization respectively.
arXiv Detail & Related papers (2020-02-25T18:28:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.