Related papers: Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

URL: http://arxiv.org/abs/2402.08958v1
Date: Wed, 14 Feb 2024 05:58:43 GMT
Title: Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers
Authors: Junhan Kim, Kyungphil Park, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
Abstract summary: Post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile devices and TVs. In this paper, we propose a novel PTQ algorithm that balances accuracy and efficiency.
Score: 10.883809442514135
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile devices and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyper-parameter tunings are required. As a cost-effective alternative, one-shot PTQ schemes have been proposed. Still, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a very important feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while considering cross-layer dependency to preserve the attention score. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.

Related papers

QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z)
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models [12.716956318428652]
SegQuant is a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility.<n>SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.
arXiv Detail & Related papers (2025-07-20T04:00:53Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
TQ-DiT: Efficient Time-Aware Quantization for Diffusion Transformers [3.389132862174821]
We introduce model quantization, which represents the weights and activation values with lower precision. Time-grouping quantization (TGQ) is proposed to reduce quantization error caused by temporal variation in activations. The proposed algorithm achieves performance comparable to the original full-precision model with only a 0.29 increase in FID at W8A8.
arXiv Detail & Related papers (2025-02-06T13:14:52Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
Attention-aware Post-training Quantization without Backpropagation [11.096116957844014]
Quantization is a promising solution for deploying large-scale language models on resource-constrained devices. Existing quantization approaches rely on gradient-based optimization. We propose a novel PTQ algorithm that considers inter-layer dependencies without relying on backpropagation.
arXiv Detail & Related papers (2024-06-19T11:53:21Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Efficient Quantization Strategies for Latent Diffusion Models [20.942161659019554]
Latent Diffusion Models (LDMs) capture the dynamic evolution of latent variables over time. Post Training Quantization (PTQ) is a method to compress the operational size of deep learning models. This study proposes a quantization strategy that efficiently quantize LDMs.
arXiv Detail & Related papers (2023-12-09T01:47:16Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs) RepQ-ViT decouples the quantization and inference processes. It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z)
Performance Optimization for Variable Bitwidth Federated Learning in Wireless Networks [103.22651843174471]
This paper considers improving wireless communication and computation efficiency in federated learning (FL) via model quantization. In the proposed bitwidth FL scheme, edge devices train and transmit quantized versions of their local FL model parameters to a coordinating server, which aggregates them into a quantized global model and synchronizes the devices. We show that the FL training process can be described as a Markov decision process and propose a model-based reinforcement learning (RL) method to optimize action selection over iterations.
arXiv Detail & Related papers (2022-09-21T08:52:51Z)
Parameter-Parallel Distributed Variational Quantum Algorithm [7.255056332088222]
Variational quantum algorithms (VQAs) have emerged as a promising near-term technique to explore practical quantum advantage on noisy devices. Here, we propose a parameter-parallel distributed variational quantum algorithm (PPD-VQA) to accelerate the training process by parameter-parallel training with multiple quantum processors. The achieved results suggest that the PPD-VQA could provide a practical solution for coordinating multiple quantum processors to handle large-scale real-word applications.
arXiv Detail & Related papers (2022-07-31T15:09:12Z)
PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result. Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.