Fugu-MT 論文翻訳(概要): ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

論文の概要: ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

arxiv url: http://arxiv.org/abs/2510.19482v1
Date: Wed, 22 Oct 2025 11:20:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:15.734164
Title: ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices
Title（参考訳）: ELUTQ: エッジデバイスに大規模言語モデルをデプロイするための効率的なLUT-Aware量子化
Authors: Xin Nie, Liang Dong, HaiCheng Zhang, JiaWang Xiao, G. Sun,
Abstract要約: CPUベースのエッジデバイス上の大規模言語モデル(LLM)は、デバイス上のインテリジェンスの実現とAIアクセシビリティの拡大に不可欠である。我々は,新しい量子化形式である階層線形量子化(HLQ)を導入した効率的な量子化フレームワークELUTQを提案する。 HLQは計算コストを増大させることなく、重量の統計特性をよりよく捉える。 LLaMA3-8Bの場合、HLQは3ビットで約8%、2ビット精度で約85%のパープレキシティを減少させる。
参考スコア（独自算出の注目度）: 3.465218658690795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The deployment of Large Language Models (LLMs) on CPU-based edge devices is crucial for enabling on-device intelligence and expanding AI accessibility. However, it remains challenging due to limited memory and computational resources. During edge inference, memory usage and latency are the primary bottlenecks. Although weight quantization can effectively reduce memory consumption, existing hardware-friendly approaches often rely on uniform quantization, which poorly fits weight distributions and incurs high dequantization overhead at low bit widths. To address these limitations, we propose ELUTQ, an efficient quantization framework introducing a novel quantization format, Hierarchical Linear Quantization (HLQ). HLQ better captures the statistical characteristics of weights without increasing the computational cost of Bit-serial LUT-based GEMM operations, thereby eliminating dequantization overhead. It is orthogonal to existing quantization algorithms and can be seamlessly integrated into various quantization pipelines. For efficient on-device deployment, ELUTQ provides optimized CPU kernels for end-to-end inference. Experiments show that for LLaMA3-8B, HLQ reduces perplexity by about 8% at 3-bit and 85% at 2-bit precision under post-training quantization, completing quantization within one hour. With efficient finetuning, HLQ further improves 2-bit performance within two hours. In terms of inference efficiency, our 2-bit LLaMA2-7B achieves over 25 tokens/s on an Apple M2 chip (4 threads, batch size = 1).
Abstract（参考訳）: CPUベースのエッジデバイスへのLarge Language Models(LLM)のデプロイは、デバイス上のインテリジェンスの実現とAIアクセシビリティの拡大に不可欠である。しかし、メモリと計算資源が限られているため、依然として困難である。エッジ推論では、メモリ使用量とレイテンシが主なボトルネックである。重み量子化はメモリ消費を効果的に削減できるが、既存のハードウェアフレンドリーなアプローチは、重量分布に不適合な均一な量子化に依存し、低ビット幅での重み量子化オーバーヘッドを生じさせる。これらの制約に対処するため,新しい量子化形式である階層線形量子化(HLQ)を導入した効率的な量子化フレームワークELUTQを提案する。 HLQは、ビットシリアルLUTベースのGEMM演算の計算コストを増大させることなく、重みの統計特性をよりよく把握し、遅延化オーバーヘッドをなくす。既存の量子化アルゴリズムと直交しており、様々な量子化パイプラインにシームレスに統合することができる。デバイス上での効率的なデプロイメントのために、ELUTQはエンドツーエンドの推論のために最適化されたCPUカーネルを提供する。 LLaMA3-8Bの場合、HLQは3ビットで約8%、2ビットの精度で約85%減少し、1時間以内に量子化が完了する。効率的な微調整により、HLQは2時間以内に2ビットパフォーマンスをさらに改善する。推論効率の面では、2ビットのLLaMA2-7Bは、Apple M2チップ(4スレッド、バッチサイズ = 1)上で25以上のトークン/sを達成する。

論文の概要: ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

関連論文リスト