Related papers: Extreme Model Compression with Structured Sparsity at Low Precision

Extreme Model Compression with Structured Sparsity at Low Precision

URL: http://arxiv.org/abs/2511.08360v1
Date: Wed, 12 Nov 2025 01:55:22 GMT
Title: Extreme Model Compression with Structured Sparsity at Low Precision
Authors: Dan Liu, Nikita Dvornik, Xue Liu,
Abstract summary: Deep neural networks (DNNs) are used in many applications, but their large size and high computational cost make them hard to run on devices with limited resources.<n>Two widely used techniques to address this challenge are weight quantization, which lowers the precision of all weights, and structured sparsity, which removes unimportant weights while retaining the important ones at full precision.<n>We introduce SLOPE Structured Sparsity at Low Precision, a unified framework, to effectively combine structured sparsity and low-bit quantization in a principled way.
Score: 10.976782748075067
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep neural networks (DNNs) are used in many applications, but their large size and high computational cost make them hard to run on devices with limited resources. Two widely used techniques to address this challenge are weight quantization, which lowers the precision of all weights, and structured sparsity, which removes unimportant weights while retaining the important ones at full precision. Although both are effective individually, they are typically studied in isolation due to their compounded negative impact on model accuracy when combined. In this work, we introduce SLOPE Structured Sparsity at Low Precision), a unified framework, to effectively combine structured sparsity and low-bit quantization in a principled way. We show that naively combining sparsity and quantization severely harms performance due to the compounded impact of both techniques. To address this, we propose a training-time regularization strategy that minimizes the discrepancy between full-precision weights and their sparse, quantized counterparts by promoting angular alignment rather than direct matching. On ResNet-18, SLOPE achieves $\sim20\times$ model size reduction while retaining $\sim$99% of the original accuracy. It consistently outperforms state-of-the-art quantization and structured sparsity methods across classification, detection, and segmentation tasks on models such as ResNet-18, ViT-Small, and Mask R-CNN.

Related papers

Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization [56.5199302532159]
We propose an Activation-guided Structured Regularization framework to suppress the negative effects of outliers.<n>Astro actively reconstructs intrinsically robust weights, aggressively suppressing weight outliers corresponding to high-magnitude activations.<n>Astro achieves highly competitive performance; notably, on LLaMA-2-7B, it achieves better performance than complex learning-based rotation methods with almost 1/3 of the quantization time.
arXiv Detail & Related papers (2026-02-07T15:50:18Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Progressive Element-wise Gradient Estimation for Neural Network Quantization [2.1413624861650358]
Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions.<n>We propose Progressive Element-wise Gradient Estimation (PEGE) to address discretization errors between continuous and quantized values.<n>PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.
arXiv Detail & Related papers (2025-08-27T15:59:36Z)
Column-wise Quantization of Weights and Partial Sums for Accurate and Efficient Compute-In-Memory Accelerators [7.728820930581886]
CIM is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead.<n>Low-precision ADCs can reduce this overhead but introduce partial-sum quantization errors degrading accuracy.<n>This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level.
arXiv Detail & Related papers (2025-02-11T05:32:14Z)
Effective Interplay between Sparsity and Quantization: From Theory to Practice [33.697590845745815]
We show how sparsity and quantization interact when combined together.<n>We show that even if applied in the correct order, the compounded errors from sparsity and quantization can significantly harm accuracy.<n>Our findings extend to the efficient deployment of large models in resource-constrained compute platforms.
arXiv Detail & Related papers (2024-05-31T15:34:13Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
BiTAT: Neural Network Binarization with Task-dependent Aggregated Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation. Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration. This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z)
Load-balanced Gather-scatter Patterns for Sparse Deep Neural Networks [20.374784902476318]
Pruning, as a method to introduce zeros to model weights, has shown to be an effective method to provide good trade-offs between model accuracy and computation efficiency. Some modern processors are equipped with fast on-chip scratchpad memories and gather/scatter engines that perform indirect load and store operations on such memories. In this work, we propose a set of novel sparse patterns, named gather-scatter (GS) patterns, to utilize the scratchpad memories and gather/scatter engines to speed up neural network inferences.
arXiv Detail & Related papers (2021-12-20T22:55:45Z)
Sharpness-aware Quantization for Deep Neural Networks [45.150346855368]
Sharpness-Aware Quantization (SAQ) is a novel method to explore the effect of Sharpness-Aware Minimization (SAM) on model compression. We show that SAQ improves the generalization performance of the quantized models, yielding the SOTA results in uniform quantization.
arXiv Detail & Related papers (2021-11-24T05:16:41Z)
Compact representations of convolutional neural networks via weight pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization. We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z)
Sparse Uncertainty Representation in Deep Learning with Inducing Weights [22.912675044223302]
We extend Matheron's conditional Gaussian sampling rule to enable fast weight sampling, which enables our inference method to maintain reasonable run-time as compared with ensembles. Our approach achieves competitive performance to the state-of-the-art in prediction and uncertainty estimation tasks with fully connected neural networks and ResNets.
arXiv Detail & Related papers (2021-05-30T18:17:47Z)
Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z)
Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters. Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques. We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.