Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2
- URL: http://arxiv.org/abs/2512.23367v1
- Date: Mon, 29 Dec 2025 10:50:23 GMT
- Title: Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2
- Authors: Yilun Luo, HuaQing Zheng, Haoqian Meng, Wenyuan Liu, Peng Zhang,
- Abstract summary: Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B integrate three distinct Chain-of-Thought (CoT) reasoning paradigms.<n>These CoT modes introduce substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs.<n>This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic.
- Score: 3.309291427648113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B, variants of the openPangu large language model, integrate three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think. While these CoT modes enhance reasoning capabilities, their generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for openPangu-Embedded models on the Atlas A2. Our comprehensive evaluation, conducted across all three CoT modes on code generation benchmarks (HumanEval and MBPP), demonstrates the efficacy of this approach. INT8 quantization consistently preserves over 90\% of the FP16 baseline accuracy and achieves a 1.5x prefill speedup on the Atlas A2. Furthermore, W4A8 quantization significantly reduces memory consumption, albeit with a moderate trade-off in accuracy. These findings collectively indicate that low-bit quantization effectively facilitates efficient CoT reasoning on Ascend NPUs, maintaining high model fidelity.
Related papers
- CoT-X: An Adaptive Framework for Cross-Model Chain-of-Thought Transfer and Optimization [5.857877898558651]
Chain-of-Thought (CoT) reasoning enhances the problem-solving ability of large language models (LLMs) but leads to substantial inference overhead.<n>This paper investigates efficient CoT transfer across models of different scales and architectures through an adaptive reasoning summarization framework.
arXiv Detail & Related papers (2025-11-07T22:35:31Z) - INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z) - Binary Quantization For LLMs Through Dynamic Grouping [13.578307208515819]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of Natural Language Processing (NLP) tasks.<n> Binary quantization, which compresses model weights from 16-bit Brain Float to 1-bit representations in -1, 1, offers significant reductions in storage and inference costs.<n>We propose a novel optimization objective tailored for binary quantization, along with three algorithms designed to realize it effectively.
arXiv Detail & Related papers (2025-09-03T06:36:21Z) - Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques [0.0]
We show that float8 quantization achieves a 4x storage reduction with minimal performance degradation.<n> PCA emerges as the most effective dimensionality reduction technique.<n>We propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration.
arXiv Detail & Related papers (2025-04-30T18:20:16Z) - "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear.<n>We conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - ZeroQuant: Efficient and Affordable Post-Training Quantization for
Large-Scale Transformers [29.566132632781848]
We present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant.
ZeroQuant is an end-to-end quantization and inference pipeline with three main components.
arXiv Detail & Related papers (2022-06-04T00:28:21Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.