Related papers: From Bits to Chips: An LLM-based Hardware-Aware Quantization Agent for Streamlined Deployment of LLMs

From Bits to Chips: An LLM-based Hardware-Aware Quantization Agent for Streamlined Deployment of LLMs

URL: http://arxiv.org/abs/2601.03484v1
Date: Wed, 07 Jan 2026 00:39:09 GMT
Title: From Bits to Chips: An LLM-based Hardware-Aware Quantization Agent for Streamlined Deployment of LLMs
Authors: Kaiyuan Deng, Hangyu Zheng, Minghai Qing, Kunxiong Zhu, Gen Li, Yang Xiao, Lan Emily Zhang, Linke Guo, Bo Hui, Yanzhi Wang, Geng Yuan, Gagan Agrawal, Wei Niu, Xiaolong Ma,
Abstract summary: We introduce the Hardware-Aware Quantization Agent (HAQA) to streamline the quantization and deployment process.<n>Results demonstrate up to a 2.3x speedup in inference, along with increased throughput and improved accuracy.
Score: 43.33830246397275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying models, especially large language models (LLMs), is becoming increasingly attractive to a broader user base, including those without specialized expertise. However, due to the resource constraints of certain hardware, maintaining high accuracy with larger model while meeting the hardware requirements remains a significant challenge. Model quantization technique helps mitigate memory and compute bottlenecks, yet the added complexities of tuning and deploying quantized models further exacerbates these challenges, making the process unfriendly to most of the users. We introduce the Hardware-Aware Quantization Agent (HAQA), an automated framework that leverages LLMs to streamline the entire quantization and deployment process by enabling efficient hyperparameter tuning and hardware configuration, thereby simultaneously improving deployment quality and ease of use for a broad range of users. Our results demonstrate up to a 2.3x speedup in inference, along with increased throughput and improved accuracy compared to unoptimized models on Llama. Additionally, HAQA is designed to implement adaptive quantization strategies across diverse hardware platforms, as it automatically finds optimal settings even when they appear counterintuitive, thereby reducing extensive manual effort and demonstrating superior adaptability. Code will be released.

Related papers

LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge [12.772499009055194]
We propose a lightweight, quantized-adaptive framework for Vision-Language Models (VLMs)<n>We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware.<n> Experiments show that LQA improves overall adaptation performance by 4.5%, uses less memory, and significantly outperforms gradient-based TTA methods.
arXiv Detail & Related papers (2026-02-08T07:37:37Z)
Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment [1.0858059444801136]
Large Language Models (LLMs) enable advanced natural language processing but face deployment challenges on resource-constrained edge devices.<n>We propose an integrated framework combining GPTQ-based quantization, low-rank adaptation (LoRA), and a specialized data distillation process.
arXiv Detail & Related papers (2026-01-14T20:50:30Z)
Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z)
IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Experts [28.9807389592324]
Large language model (LLM) agents have emerged as a promising solution to automate the workflow of machine learning.<n>We introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design inspired by how human ML experts iteratively refine models.<n>By systematically updating individual components based on real training feedback, Iterative Refinement improves overall model performance.
arXiv Detail & Related papers (2025-02-25T01:52:37Z)
LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment [12.80921403367322]
Large Language Models (LLMs) demonstrate exceptional performance across various domains.<n> Quantization techniques, which reduce the size and memory requirements of LLMs, are effective for deploying LLMs on resource-limited edge devices.<n>We propose Layer-Specific Adaptive Quantization (LSAQ), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance.
arXiv Detail & Related papers (2024-12-24T03:43:15Z)
QSpec: Speculative Decoding with Complementary Quantization Schemes [53.960146187821685]
Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs)<n>We propose QSpec, a novel quantization paradigm that decouples efficiency from quality.<n>QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models.
arXiv Detail & Related papers (2024-10-15T05:57:51Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid [36.33062038680275]
Large language models (LLMs) have shown immense potential across various domains. Post-training quantization has emerged as a promising technique to reduce memory requirements and decoding latency. We propose LeanQuant, a novel quantization method that is accurate, versatile, and scalable.
arXiv Detail & Related papers (2024-07-14T00:23:51Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models [85.02796681773447]
We propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation. QA-LoRA is easily implemented with a few lines of code.
arXiv Detail & Related papers (2023-09-26T07:22:23Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale [11.121380180647769]
We share in this paper our search strategies to adapt reference recommendation models to low-precision hardware. We also discuss the design and development of tool chain so as to maintain our models' accuracy throughout their lifespan. We believe these lessons from the trenches promote better co-design between hardware architecture and software engineering.
arXiv Detail & Related papers (2021-05-26T16:42:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.