PT$^2$-LLM: Post-Training Ternarization for Large Language Models
- URL: http://arxiv.org/abs/2510.03267v1
- Date: Sat, 27 Sep 2025 03:01:48 GMT
- Title: PT$^2$-LLM: Post-Training Ternarization for Large Language Models
- Authors: Xianglong Yan, Chengzhu Bao, Zhiteng Li, Tianao Zhang, Kaicheng Yang, Haotong Qin, Ruobing Xie, Xingwu Sun, Yulun Zhang,
- Abstract summary: Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
- Score: 52.4629647715623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment. Ternarization has gained attention as a promising compression technique, delivering substantial size reduction and high computational efficiency. However, its potential in the post-training quantization (PTQ) setting remains underexplored, due to the challenge of training-free parameter optimization and the quantization difficulty posed by outliers and dispersed weights. To address these issues, we propose PT$^2$-LLM, a post-training ternarization framework tailored for LLMs. At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline: (1) Iterative Ternary Fitting (ITF), which alternates between optimal ternary grid construction and flexible rounding to minimize quantization error, and (2) Activation-aware Grid Alignment (AGA), which further refines the ternary grid to better match full-precision outputs. In addition, we propose a plug-and-play Structural Similarity-based Reordering (SSR) strategy that leverages inter-column structural similarity to ease quantization and mitigate outlier effects, further enhancing overall performance. Extensive experiments demonstrate that PT$^2$-LLM delivers competitive performance against state-of-the-art (SOTA) 2-bit PTQ methods with lower memory cost, while also accelerating both prefill and decoding to achieve end-to-end speedup. The code and models will be available at https://github.com/XIANGLONGYAN/PT2-LLM.
Related papers
- RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs [5.782015253162346]
Residual binarization enables matmul-free inference by stacking binary layers.<n>We propose RaBiT, a novel quantization framework that resolves coadaptation by algorithmically enforcing a residual hierarchy.<n>RaBiT achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49times$ inference speed-up over full-precision models.
arXiv Detail & Related papers (2026-02-05T06:41:11Z) - DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z) - TNT: Improving Chunkwise Training for Test-Time Memorization [62.78875147721906]
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers.<n>We introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process.<n>TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration.
arXiv Detail & Related papers (2025-11-10T17:45:09Z) - LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Image and Video Generation [41.66473889057111]
Diffusion Transformers (DiTs) have achieved impressive performance in text-to-image and text-to-video generation.<n>DiTs' high computational cost and large parameter sizes pose significant challenges for usage in resource-constrained scenarios.<n>We propose LRQ-DiT, an efficient and accurate post-training quantization framework for image and video generation.
arXiv Detail & Related papers (2025-08-05T14:16:11Z) - MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - Two-dimensional Parallel Tempering for Constrained Optimization [0.3068068202044424]
We introduce a two-dimensional extension of the powerful parallel tempering algorithm (PT)<n>The resulting two-dimensional parallel tempering algorithm (2D-PT) improves mixing in heavily constrained replicas.<n>The method applies broadly to constrained Ising problems and can be deployed on existing Ising machines.
arXiv Detail & Related papers (2025-05-24T20:41:45Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Progressive Binarization with Semi-Structured Pruning for LLMs [36.91249209658632]
We propose Progressive Binarization with Semi-Structured Pruning (PBS$2$P), a novel post-training framework that seamlessly integrates binarization and semi-structured pruning.<n>We show that PBS$2$P consistently outperforms state-of-the-art (SOTA) binary post-training quantization methods in both perplexity and downstream accuracy.
arXiv Detail & Related papers (2025-02-03T13:30:29Z) - OutlierTune: Efficient Channel-Wise Quantization for Large Language Models [24.645237670811476]
OutlierTune is an efficient per-channel post-training quantization method for the activations of large language models.
The proposed framework is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference.
arXiv Detail & Related papers (2024-06-27T02:02:26Z) - CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.