MLoRQ: Bridging Low-Rank and Quantization for Transformer Compression
- URL: http://arxiv.org/abs/2507.09616v1
- Date: Sun, 13 Jul 2025 12:48:46 GMT
- Title: MLoRQ: Bridging Low-Rank and Quantization for Transformer Compression
- Authors: Ofir Gordon, Ariel Lapid, Elad Cohen, Yarden Yagil, Arnon Netzer, Hai Victor Habi,
- Abstract summary: Mixed Low-Rank and Quantization (MLoRQ) is a novel method that integrates low-rank approximation and mixed-precision quantization.<n>MLoRQ shows state-of-the-art results with up to 15% performance improvement.
- Score: 2.9907287985468924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying transformer-based neural networks on resource-constrained edge devices presents a significant challenge. This challenge is often addressed through various techniques, such as low-rank approximation and mixed-precision quantization. In this work, we introduce Mixed Low-Rank and Quantization (MLoRQ), a novel method that integrates both techniques. MLoRQ employs a two-stage optimization process to determine optimal bit-width and rank assignments for each layer, adhering to predefined memory constraints. This process includes: (i) an intra-layer optimization that identifies potentially optimal compression solutions out of all low-rank and quantization combinations; (ii) an inter-layer optimization that assigns bit-width precision and rank to each layer while ensuring the memory constraint is met. An optional final step applies a sequential optimization process using a modified adaptive rounding technique to mitigate compression-induced errors in joint low-rank approximation and quantization. The method is compatible and can be seamlessly integrated with most existing quantization algorithms. MLoRQ shows state-of-the-art results with up to 15\% performance improvement, evaluated on Vision Transformers for image classification, object detection, and instance segmentation tasks.
Related papers
- A Gradient Meta-Learning Joint Optimization for Beamforming and Antenna Position in Pinching-Antenna Systems [63.213207442368294]
We consider a novel optimization design for multi-waveguide pinching-antenna systems.<n>The proposed GML-JO algorithm is robust to different choices and better performance compared with the existing optimization methods.
arXiv Detail & Related papers (2025-06-14T17:35:27Z) - Scalable Min-Max Optimization via Primal-Dual Exact Pareto Optimization [66.51747366239299]
We propose a smooth variant of the min-max problem based on the augmented Lagrangian.<n>The proposed algorithm scales better with the number of objectives than subgradient-based strategies.
arXiv Detail & Related papers (2025-03-16T11:05:51Z) - Investigating layer-selective transfer learning of QAOA parameters for Max-Cut problem [1.515687944002438]
We numerically explore the role of individual QAOA layers in improving the approximate solution of the Max-Cut problem after parameter transfer.<n>These studies show that optimizing a subset of layers can be more effective at a lower time-cost compared to optimizing all layers.
arXiv Detail & Related papers (2024-12-30T16:41:16Z) - Variational quantum algorithm for enhanced continuous variable optical
phase sensing [0.0]
Variational quantum algorithms (VQAs) are hybrid quantum-classical approaches used for tackling a wide range of problems on noisy quantum devices.
We implement a variational algorithm designed for optimized parameter estimation on a continuous variable platform based on squeezed light.
arXiv Detail & Related papers (2023-12-21T14:11:05Z) - Federated Conditional Stochastic Optimization [110.513884892319]
Conditional optimization has found in a wide range of machine learning tasks, such as in-variant learning tasks, AUPRC, andAML.
This paper proposes algorithms for distributed federated learning.
arXiv Detail & Related papers (2023-10-04T01:47:37Z) - QuantEase: Optimization-based Quantization for Language Models [17.333778751252392]
This work introduces Quantization (PTQ) of various quantization layers from recent advances of Large Language Models (LLMs)
Our CD-based approach features straightforward updates, relying solely on vector operations.
We also explore an outlier approach, allowing for retaining significant weights (outoutliers) with complete precision.
arXiv Detail & Related papers (2023-09-05T01:39:09Z) - Mixed-Precision Quantization for Deep Vision Models with Integer Quadratic Programming [7.0146264551420066]
Quantization is a widely used technique to compress neural networks.<n>MPQ addresses this by assigning varied bit-widths to layers, optimizing the accuracy-efficiency trade-off.<n>We introduce CLADO, a practical sensitivity-based MPQ algorithm that captures crosslayer dependency of quantization error.
arXiv Detail & Related papers (2023-07-11T15:56:00Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Towards Mixed-Precision Quantization of Neural Networks via Constrained
Optimization [28.76708310896311]
We present a principled framework to solve the mixed-precision quantization problem.
We show that our method is derived in a principled way and much more computationally efficient.
arXiv Detail & Related papers (2021-10-13T08:09:26Z) - BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network
Quantization [32.770842274996774]
Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks.
Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space.
This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity.
arXiv Detail & Related papers (2021-02-20T22:37:41Z) - Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy.
We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR.
Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z) - Adaptive pruning-based optimization of parameterized quantum circuits [62.997667081978825]
Variisy hybrid quantum-classical algorithms are powerful tools to maximize the use of Noisy Intermediate Scale Quantum devices.
We propose a strategy for such ansatze used in variational quantum algorithms, which we call "Efficient Circuit Training" (PECT)
Instead of optimizing all of the ansatz parameters at once, PECT launches a sequence of variational algorithms.
arXiv Detail & Related papers (2020-10-01T18:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.