Related papers: Trainable Bitwise Soft Quantization for Input Feature Compression

Trainable Bitwise Soft Quantization for Input Feature Compression

URL: http://arxiv.org/abs/2603.05172v1
Date: Thu, 05 Mar 2026 13:40:55 GMT
Title: Trainable Bitwise Soft Quantization for Input Feature Compression
Authors: Karsten Schrödter, Jan Stenkamp, Nina Herrmann, Fabian Gieseke,
Abstract summary: We propose a task-specific, trainable feature quantization layer that compresses the input features of a neural network.<n>This can significantly reduce the amount of data that needs to be transferred from the device to a remote server.
Score: 0.7559720049837458
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The growing demand for machine learning applications in the context of the Internet of Things calls for new approaches to optimize the use of limited compute and memory resources. Despite significant progress that has been made w.r.t. reducing model sizes and improving efficiency, many applications still require remote servers to provide the required resources. However, such approaches rely on transmitting data from edge devices to remote servers, which may not always be feasible due to bandwidth, latency, or energy constraints. We propose a task-specific, trainable feature quantization layer that compresses the input features of a neural network. This can significantly reduce the amount of data that needs to be transferred from the device to a remote server. In particular, the layer allows each input feature to be quantized to a user-defined number of bits, enabling a simple on-device compression at the time of data collection. The layer is designed to approximate step functions with sigmoids, enabling trainable quantization thresholds. By concatenating outputs from multiple sigmoids, introduced as bitwise soft quantization, it achieves trainable quantized values when integrated with a neural network. We compare our method to full-precision inference as well as to several quantization baselines. Experiments show that our approach outperforms standard quantization methods, while maintaining accuracy levels close to those of full-precision models. In particular, depending on the dataset, compression factors of $5\times$ to $16\times$ can be achieved compared to $32$-bit input without significant performance loss.

Related papers

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z)
An Efficient Quantum Classifier Based on Hamiltonian Representations [50.467930253994155]
Quantum machine learning (QML) is a discipline that seeks to transfer the advantages of quantum computing to data-driven tasks.<n>We propose an efficient approach that circumvents the costs associated with data encoding by mapping inputs to a finite set of Pauli strings.<n>We evaluate our approach on text and image classification tasks, against well-established classical and quantum models.
arXiv Detail & Related papers (2025-04-13T11:49:53Z)
Quantization without Tears [26.5790668319932]
Quantization without Tears (QwT) is a method that simultaneously achieves quantization speed, accuracy, simplicity, and generality.<n>QwT incorporates a lightweight additional structure into the quantized network to mitigate information loss during quantization.<n> Extensive experiments across various vision, language, and multimodal tasks demonstrate that QwT is both highly effective and versatile.
arXiv Detail & Related papers (2024-11-21T08:13:24Z)
Post-Training Quantization for Re-parameterization via Coarse & Fine Weight Splitting [13.270381125055275]
We propose a coarse & fine weight splitting (CFWS) method to reduce quantization error of weight. We develop an improved KL metric to determine optimal quantization scales for activation. For example, the quantized RepVGG-A1 model exhibits a mere 0.3% accuracy loss.
arXiv Detail & Related papers (2023-12-17T02:31:20Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Scaled Quantization for the Vision Transformer [0.0]
Quantization using a small number of bits shows promise for reducing latency and memory usage in deep neural networks. This paper proposes a robust method for the full integer quantization of vision transformer networks without requiring any intermediate floating-point computations.
arXiv Detail & Related papers (2023-03-23T18:31:21Z)
A Practical Mixed Precision Algorithm for Post-Training Quantization [15.391257986051249]
Mixed-precision quantization is a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. We present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset. We show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
arXiv Detail & Related papers (2023-02-10T17:47:54Z)
OMPQ: Orthogonal Mixed Precision Quantization [72.63889596498004]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.<n>We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.<n>This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z)
Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks. DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons. We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
A White Paper on Neural Network Quantization [20.542729144379223]
We introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance. We consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT)
arXiv Detail & Related papers (2021-06-15T17:12:42Z)
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.