Fugu-MT 論文翻訳(概要): QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

論文の概要: QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

arxiv url: http://arxiv.org/abs/2606.04620v1
Date: Wed, 03 Jun 2026 08:55:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.641893
Title: QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy
Title（参考訳）: QuBLAST:ブロックレベル圧縮アプローチとアクティベーションスケーリング戦略による大規模言語モデルの定量化フレームワーク
Authors: Pasindu Wickramasinghe, Achyuta Muthuvelan, Rachmad Vidya Wicaksana Putra, Minghao Shao, Muhammad Shafique,
Abstract要約: QuBLASTは、LCMのアクティベーションスケーリング戦略を備えたブロックレベルの圧縮手法である。異なるモデルアーキテクチャでモデルサイズを40%から45.2%削減する。 WikiText-2とWikiText-103データセットのパフォーマンスは、5%のパープレキシティ向上で維持されている。
参考スコア（独自算出の注目度）: 4.215434651178227
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLMs have become the state-of-the-art algorithms for solving NLP tasks. However, they typically come at huge computational and memory costs, thus making them difficult to deploy on embedded systems. Toward this, state-of-the-art methods typically employ uniform post-training quantization (PTQ) across attention blocks of the network, hence overlooking the potential of applying different quantization levels in the same network. They also employ complex operations to mitigate the negative impact of activation outliers, hence incurring high computational overheads. Moreover, they have not considered evaluation using emerging LLMs with non-conventional attention architectures (e.g., state-space models), which pose different challenges in applying quantization. To address these limitations, we propose QuBLAST, a novel PTQ methodology that employs block-level compression approach with activation scaling strategy for LLMs. Block-level compression approach enables mixed-precision quantization across blocks of the network, while activation scaling strategy efficiently mitigates the negative impact of activation outliers. Specifically, QuBLAST first analyzes the sensitivity of different attention blocks in the pre-trained model through the cross-entropy loss analysis. QuBLAST leverages this sensitivity analysis to determine the weight quantization level for each attention block in the model. Furthermore, QuBLAST employs the activation scaling map for each block to control the range of activation values and mitigate the negative impact of activation outliers, thereby enabling better quantization results. Experimental results show that, QuBLAST reduces model sizes by 40%-45.2% across different model architectures (i.e., Qwen3-8B, Llama3-8B, Mistral v0.1-8B, and Falcon H1R-7B), while maintaining the performance within 5% perplexity increase for the WikiText-2 and WikiText-103 datasets.
Abstract（参考訳）: LLMはNLPタスクを解くための最先端のアルゴリズムとなっている。しかし、それらは一般的に計算とメモリのコストが大きいため、組み込みシステムへのデプロイが困難になる。これに向けて、最先端の手法では、通常、ネットワークの注意ブロックをまたいだ均一なポストトレーニング量子化(PTQ)を用いるため、同じネットワークに異なる量子化レベルを適用する可能性を見越すことができる。また、アクティベーション・アウトレイアの負の影響を軽減するために複雑な演算を用い、高い計算オーバーヘッドを発生させる。さらに,非伝統的な注意アーキテクチャ(例えば状態空間モデル)を持つ新興LLMを用いた評価を考慮せず,量子化の適用において異なる課題を提起している。これらの制約に対処するために,ブロックレベルの圧縮手法とLCMのアクティベーションスケーリング戦略を用いた新しいPTQ手法QuBLASTを提案する。ブロックレベルの圧縮アプローチは、ネットワークのブロック間での混合精度量子化を可能にし、アクティベーションスケーリング戦略は、アクティベーションアウトリーの負の影響を効果的に軽減する。具体的には、QuBLASTはまず、クロスエントロピー損失分析により、事前学習されたモデルにおける異なる注意ブロックの感度を分析する。 QuBLASTはこの感度分析を利用して、モデル内の各注目ブロックの重量量子化レベルを決定する。さらに、QuBLASTでは、各ブロックに対してアクティベーションスケーリングマップを使用して、アクティベーション値の範囲を制御し、アクティベーションアウトリアの負の影響を軽減することにより、より優れた量子化結果を実現する。実験の結果、QuBLASTは異なるモデルアーキテクチャ(Qwen3-8B、Llama3-8B、Mistral v0.1-8B、Falcon H1R-7B)でモデルサイズを40%-45.2%削減し、WikiText-2とWikiText-103データセットでは5%の複雑さで性能を向上した。

論文の概要: QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

関連論文リスト