Related papers: Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

URL: http://arxiv.org/abs/2502.16638v1
Date: Sun, 23 Feb 2025 16:28:18 GMT
Title: Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression
Authors: Xiaoyi Qu, David Aponte, Colby Banbury, Daniel P. Robinson, Tianyu Ding, Kazuhito Koishida, Ilya Zharkov, Tianyi Chen,
Abstract summary: Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs)<n>Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models.<n>We present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs.
Score: 44.35542987414442
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs) and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs. GETA introduces three key innovations: (i) a quantization-aware dependency graph (QADG) that constructs a pruning search space for generic quantization-aware DNN, (ii) a partially projected stochastic gradient method that guarantees layerwise bit constraints are satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to existing joint pruning and quantization methods.

Related papers

Compliance Minimization via Physics-Informed Gaussian Processes [3.6352820455705372]
We propose a mesh-free and simultaneous framework based on physics-informed Gaussian processes (GPs)<n>In our approach, we parameterize the design and state variables with GP priors which have independent kernels but share a multi-output neural network (NN) as their mean function.<n>The architecture of this NN is based on Parametric Grid Convolutional Attention Networks (PGCANs) which not only spectral bias issues, but also provide an interpretable mechanism to control design complexity.
arXiv Detail & Related papers (2025-07-14T06:34:29Z)
Selective Feature Re-Encoded Quantum Convolutional Neural Network with Joint Optimization for Image Classification [3.8876018618878585]
Quantum convolutional neural networks (QCNNs) have demonstrated promising results in classifying both quantum and classical data.<n>This study proposes a novel strategy to enhance feature processing and a QCNN architecture for improved classification accuracy.
arXiv Detail & Related papers (2025-07-02T18:51:56Z)
Structure-Aware Automatic Channel Pruning by Searching with Graph Embedding [28.03880549472142]
Channel pruning is a powerful technique to reduce the computational overhead of deep neural networks.<n>We propose a novel structure-aware automatic channel pruning (SACP) framework to model the network topology and learn the global importance of each channel.<n>SACP outperforms state-of-the-art pruning methods on compression efficiency and competitive on accuracy retention.
arXiv Detail & Related papers (2025-06-13T05:05:35Z)
Unified Stochastic Framework for Neural Network Quantization and Pruning [11.721939479875271]
This paper introduces a unified framework for post-training quantization and pruning using path-following algorithms.<n>Our approach builds on the Path Following Quantization (SPFQ) method, extending its applicability to pruning and low-bit quantization regimes.
arXiv Detail & Related papers (2024-12-24T05:38:01Z)
Quantized Approximately Orthogonal Recurrent Neural Networks [6.524758376347808]
We explore the quantization of the weight matrices in ORNNs, leading to Quantized approximately Orthogonal RNNs (QORNNs) We propose and investigate two strategies to learn QORNN by combining quantization-aware training (QAT) and computation projections. The most efficient models achieve results similar to state-of-the-art full-precision ORNN, LSTM and FastRNN on a variety of standard benchmarks, even with 4-bits quantization.
arXiv Detail & Related papers (2024-02-05T09:59:57Z)
OTOv3: Automatic Architecture-Agnostic Neural Network Training and Compression from Structured Pruning to Erasing Operators [57.145175475579315]
This topic spans various techniques, from structured pruning to neural architecture search, encompassing both pruning and erasing operators perspectives. We introduce the third-generation Only-Train-Once (OTOv3), which first automatically trains and compresses a general DNN through pruning and erasing operations. Our empirical results demonstrate the efficacy of OTOv3 across various benchmarks in structured pruning and neural architecture search.
arXiv Detail & Related papers (2023-12-15T00:22:55Z)
HKNAS: Classification of Hyperspectral Imagery Based on Hyper Kernel Neural Architecture Search [104.45426861115972]
We propose to directly generate structural parameters by utilizing the specifically designed hyper kernels. We obtain three kinds of networks to separately conduct pixel-level or image-level classifications with 1-D or 3-D convolutions. A series of experiments on six public datasets demonstrate that the proposed methods achieve state-of-the-art results.
arXiv Detail & Related papers (2023-04-23T17:27:40Z)
AutoQNN: An End-to-End Framework for Automatically Quantizing Neural Networks [6.495218751128902]
We propose an end-to-end framework named AutoQNN, for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor. QPL is the first method to learn mixed-precision policies by re parameterizing the bitwidths of quantizing schemes. QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention.
arXiv Detail & Related papers (2023-04-07T11:14:21Z)
Training Multi-bit Quantized and Binarized Networks with A Learnable Symmetric Quantizer [1.9659095632676098]
Quantizing weights and activations of deep neural networks is essential for deploying them in resource-constrained devices or cloud platforms. While binarization is a special case of quantization, this extreme case often leads to several training difficulties. We develop a unified quantization framework, denoted as UniQ, to overcome binarization difficulties.
arXiv Detail & Related papers (2021-04-01T02:33:31Z)
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z)
Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides. We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models. Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)
Optimal Gradient Quantization Condition for Communication-Efficient Distributed Training [99.42912552638168]
Communication of gradients is costly for training deep neural networks with multiple devices in computer vision applications. In this work, we deduce the optimal condition of both the binary and multi-level gradient quantization for textbfANY gradient distribution. Based on the optimal condition, we develop two novel quantization schemes: biased BinGrad and unbiased ORQ for binary and multi-level gradient quantization respectively.
arXiv Detail & Related papers (2020-02-25T18:28:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.