Related papers: Effect of Weight Quantization on Learning Models by Typical Case Analysis

Effect of Weight Quantization on Learning Models by Typical Case Analysis

URL: http://arxiv.org/abs/2401.17269v1
Date: Tue, 30 Jan 2024 18:58:46 GMT
Title: Effect of Weight Quantization on Learning Models by Typical Case Analysis
Authors: Shuhei Kashiwamura, Ayaka Sakata, Masaaki Imaizumi
Abstract summary: The recent surge in data analysis scale has significantly increased computational resource requirements. Quantization is vital for deploying large models on devices with limited computational resources.
Score: 6.9060054915724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper examines the quantization methods used in large-scale data analysis models and their hyperparameter choices. The recent surge in data analysis scale has significantly increased computational resource requirements. To address this, quantizing model weights has become a prevalent practice in data analysis applications such as deep learning. Quantization is particularly vital for deploying large models on devices with limited computational resources. However, the selection of quantization hyperparameters, like the number of bits and value range for weight quantization, remains an underexplored area. In this study, we employ the typical case analysis from statistical physics, specifically the replica method, to explore the impact of hyperparameters on the quantization of simple learning models. Our analysis yields three key findings: (i) an unstable hyperparameter phase, known as replica symmetry breaking, occurs with a small number of bits and a large quantization width; (ii) there is an optimal quantization width that minimizes error; and (iii) quantization delays the onset of overparameterization, helping to mitigate overfitting as indicated by the double descent phenomenon. We also discover that non-uniform quantization can enhance stability. Additionally, we develop an approximate message-passing algorithm to validate our theoretical results.

Related papers

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models [48.98109982725689]
We conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths. We identify model size, model origin, and task difficulty as critical determinants of performance.
arXiv Detail & Related papers (2025-04-07T08:22:45Z)
Scaling Laws for Mixed quantization in Large Language Models [10.912306313183972]
Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the computational requirements for running inference on these models. In this study, we focus on a straightforward question: When aiming for a specific accuracy or perplexity target for low-precision quantization, how many high-precision numbers or calculations are required to preserve as we scale LLMs to larger sizes?
arXiv Detail & Related papers (2024-10-09T09:45:01Z)
How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training [1.721868124457512]
This paper investigates three different parameterizations of asymmetric uniform quantization for quantization-aware training. We propose best practices to stabilize and accelerate quantization-aware training with learnable asymmetric quantization ranges.
arXiv Detail & Related papers (2024-04-25T06:58:16Z)
Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation [24.34969722921442]
Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs) We conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization. We propose an optimized method called Low-Rank Compensation (LoRC) to enhance model quality recovery with a minimal increase in model size.
arXiv Detail & Related papers (2023-03-15T01:27:15Z)
Ternary Quantization: A Survey [12.90416661059601]
Inference time, model size, and accuracy are critical for deploying deep neural network models. We review the evolution of ternary quantization and investigate the relationships among existing ternary quantization methods.
arXiv Detail & Related papers (2023-03-02T03:38:51Z)
End-to-end resource analysis for quantum interior point methods and portfolio optimization [63.4863637315163]
We provide a complete quantum circuit-level description of the algorithm from problem input to problem output. We report the number of logical qubits and the quantity/depth of non-Clifford T-gates needed to run the algorithm.
arXiv Detail & Related papers (2022-11-22T18:54:48Z)
Mixed-Precision Inference Quantization: Radically Towards Faster inference speed, Lower Storage requirement, and Lower Loss [4.877532217193618]
Existing quantization techniques rely heavily on experience and "fine-tuning" skills. This study provides a methodology for acquiring a mixed-precise quantization model with a lower loss than the full precision model. In particular, we will demonstrate that neural networks with massive identity mappings are resistant to the quantization method.
arXiv Detail & Related papers (2022-07-20T10:55:34Z)
Q-Rater: Non-Convex Optimization for Post-Training Uniform Quantization [9.062897838978955]
Various post-training quant uniformization methods have usually been based on convex optimization. Our proposed technique presents higher model accuracy, especially for a low quantization.
arXiv Detail & Related papers (2021-05-05T05:14:22Z)
Quantum Algorithms for Data Representation and Analysis [68.754953879193]
We provide quantum procedures that speed-up the solution of eigenproblems for data representation in machine learning. The power and practical use of these subroutines is shown through new quantum algorithms, sublinear in the input matrix's size, for principal component analysis, correspondence analysis, and latent semantic analysis. Results show that the run-time parameters that do not depend on the input's size are reasonable and that the error on the computed model is small, allowing for competitive classification performances.
arXiv Detail & Related papers (2021-04-19T00:41:43Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.