Fugu-MT 論文翻訳(概要): Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

論文の概要: Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

arxiv url: http://arxiv.org/abs/2312.05693v1
Date: Sat, 9 Dec 2023 22:12:52 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-12 19:26:45.514184
Title: Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
Title（参考訳）: Agile-Quant: エッジ上のLCMの高速推論のためのアクティベーションガイド付き量子化
Authors: Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin, Chao Wu, Yanzhi Wang
Abstract要約: 大きな言語モデル(LLM)は、複雑な言語モデリングタスクにおける印象的なパフォーマンスで際立っている。近年の研究では、エンド・ツー・エンドのタスク性能に最小限の影響を伴って、8ビット以下のウェイト量子化が可能であることが示されている。我々は、人気のある大規模言語モデルのためのアクティベーション誘導量子化フレームワークであるAgile-Quantを提案する。
参考スコア（独自算出の注目度）: 45.690907522226794
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then introduced to boost LLMs' on-device efficiency. Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance, while the activation is still not quantized. On the other hand, mainstream commodity edge devices still struggle to execute these sub-8-bit quantized networks effectively. In this paper, we propose Agile-Quant, an activation-guided quantization framework for popular Large Language Models (LLMs), and implement an end-to-end accelerator on multiple edge devices for faster inference. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization strategy to balance the trade-off of task performance and real inference speed. Then we leverage the activation-aware token pruning technique to reduce the outliers and the adverse impact on attentivity. Ultimately, we utilize the SIMD-based 4-bit multiplier and our efficient TRIP matrix multiplication to implement the accelerator for LLMs on the edge. We apply our framework on different scales of LLMs including LLaMA, OPT, and BLOOM with 4-bit or 8-bit for the activation and 4-bit for the weight quantization. Experiments show that Agile-Quant achieves simultaneous quantization of model weights and activations while maintaining task performance comparable to existing weight-only quantization methods. Moreover, in the 8- and 4-bit scenario, Agile-Quant achieves an on-device speedup of up to 2.55x compared to its FP16 counterparts across multiple edge devices, marking a pioneering advancement in this domain.
Abstract（参考訳）: 大きな言語モデル(LLM)は、複雑な言語モデリングタスクにおける印象的なパフォーマンスで際立っている。しかし、それらの要求する計算とメモリは、エッジデバイスで広く使用するための障害となる。その後、LCMのデバイス上での効率を高めるために量子化が導入される。近年の研究では、8ビット以下の量子化が可能であり、エンド・ツー・エンドのタスク性能への影響は最小限であるが、アクティベーションは定量化されていない。一方、一般的なエッジデバイスは、これらのサブ8ビット量子化ネットワークを効果的に実行するのに苦労している。本稿では,人気のある大規模言語モデル(llms)のためのアクティベーション誘導量子化フレームワークであるagile-quantを提案する。ハードウェアのプロファイリングとアクティベーション分析を考慮し,タスク性能のトレードオフと実際の推論速度のバランスをとるための基本的なアクティベーション量子化戦略を導入する。次に,アクティベーション・アウェア・トークン・プルーニング技術を利用して,アウトリアーとアテンティビティへの悪影響を低減した。最終的に、SIMDベースの4ビット乗算器と効率的なTRIP行列乗算を用いて、エッジ上のLCMのアクセラレータを実装する。 llama, opt, bloom, 4ビットまたは8ビットのアクティベーションと4ビットの重み量子化を含む,さまざまなスケールのllmに適用した。実験によると、agile-quantは、既存のウェイトのみの量子化法に匹敵するタスクパフォーマンスを維持しながら、モデルウェイトとアクティベーションの同時量子化を達成している。さらに、8ビットと4ビットのシナリオでは、Agile-Quantは複数のエッジデバイスにまたがるFP16と比較して、デバイス上でのスピードアップを最大2.55倍に達成している。

関連論文リスト

Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
大規模言語モデル(LLM)は、コード生成において印象的な能力を示しており、特に自然言語で記述された要求を自動的に実装する。 LLMはメモリ(そして結果として炭素)のフットプリントに重大な課題をもたらす。 LLM量子化の新しいフロンティアは4ビット精度であり、平均メモリフットプリントが70%減少する。
論文参考訳（メタデータ） (2025-03-10T09:26:08Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
我々は,学術ベンチマークや実世界のタスクにまたがる一般的な量子化形式を評価する。 W4A16は同期デプロイメントと中間層アーキテクチャの非同期デプロイメントに最適なコスト効率を提供する。
論文参考訳（メタデータ） (2024-11-04T18:21:59Z)
MobileQuant: Mobile-friendly Quantization for On-device Language Models [31.75012542498791]
大規模言語モデル(LLM)は言語処理に革命をもたらし、複数のアプリケーションにまたがって優れた結果をもたらしている。エッジデバイスにLSMをデプロイすることは、メモリ、エネルギ、計算コストに関していくつかの課題をもたらす。我々は、従来の重み等価変換作業を拡張する、MobileQuantと呼ばれる単純な後学習量子化手法を導入する。
論文参考訳（メタデータ） (2024-08-25T20:41:22Z)
OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
本稿では, LLMの重量行列を1ビットに大胆に定量化し, LLMの極低ビット幅展開への道を開く。実験によると、OneBitは(LLaMAモデルの非量子化性能の少なくとも81%)優れたパフォーマンスを、堅牢なトレーニングプロセスで達成している。
論文参考訳（メタデータ） (2024-02-17T14:26:57Z)
EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge [40.85258685379659]
トレーニング後の量子化(PTQ)メソッドは、ウェイト、アクティベーション、KVキャッシュを同時に8ビット以下に定量化する際に品質が低下する。多くのQAT(Quantization-Aware Training)は、モデルウェイトを定量化し、アクティベーションを未修正のまま残し、エッジ上の推論加速度の量子化の可能性を完全に活用しない。 We propose EdgeQAT, the Entropy and Distribution Guided QAT for the optimization of light LLMs to achieve inference acceleration on Edge devices。
論文参考訳（メタデータ） (2024-02-16T16:10:38Z)
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization [12.655230451207956]
本稿では,Large Language Models(LLMs)における後学習量子化(PTQ)に焦点を当てる。本稿では,アクティベーション量子化対応スケーリング(AQAS)とシーケンス長対応キャリブレーション(SLAC)の2つの革新的な手法を提案する。我々の技術はタスクの精度を大幅に向上させ、完全精度モデルに匹敵するレベルまで向上することを示した。
論文参考訳（メタデータ） (2023-11-09T06:19:51Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
重み付けとアクティベーションの両方を4ビットにキャストすることで、大きな生成モデルに対する推論計算の大部分が実行可能であることを示す。これをQUIKと呼ばれるハイブリッド量子化戦略により実現し、重みとアクティベーションの大部分を4ビットに圧縮する。我々は、QUIKフォーマットを高効率なレイヤワイドランタイムに適合させるGPUカーネルを提供し、これにより、エンドツーエンドのスループットが3.4倍に向上する。
論文参考訳（メタデータ） (2023-10-13T17:15:05Z)
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM [6.85331857224501]
LLM(Large Language Models)は、メモリ要件と計算能力に関する重要なハードウェア上の課題を提起する。 LLMには2つの主要な量子化スキームがある: 粗粒(textite.g.$ channel-wise)量子化と細粒(textite.g.$ group-wise)量子化である。我々は、高速な推論速度を確保しつつ優れた性能を維持するLLMのための新しいA8W4量子化であるDual Grained Quantization (DGQ)を紹介する。
論文参考訳（メタデータ） (2023-10-07T14:50:28Z)
FPTQ: Fine-grained Post-Training Quantization for Large Language Models [28.11564378745513]
利用可能なオープンソースLLMのための新しいW4A8ポストトレーニング量子化法を提案する。我々は,BLOOM,LLaMA,LLaMA-2における最先端のW4A8量子化性能を標準ベンチマークで取得する。
論文参考訳（メタデータ） (2023-08-30T12:18:18Z)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
大規模言語モデル(LLM)は自然言語処理タスクに革命をもたらした。近年のPTQ法はメモリフットプリントの削減とLLMの計算効率の向上に有効である。多様な量子化設定において優れた性能を実現するLLMのOmnidirectly calibrated Quantization手法を提案する。
論文参考訳（メタデータ） (2023-08-25T02:28:35Z)
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [54.692405042065815]
LLM低ビット量のみの量子化のためのハードウェアフレンドリーなアプローチであるActivation-Aware Weight Quantization (AWQ)を提案する。 AWQ は 1% の正重みしか保護せず,命令調整型 LM とマルチモーダル LM の量子化性能に優れる。また,4ビットオンデバイスLLM/VLMに適した,効率的なフレキシブルな推論フレームワークであるTinyChatを実装した。
論文参考訳（メタデータ） (2023-06-01T17:59:10Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMMはSIMDハードウェア上で超高精度畳み込みニューラルネットワークを実行するためのルックアップテーブルベースのアプローチである。実装は、x86プラットフォーム上で、対応する8ビット整数カーネルを最大1.74倍の性能で上回る。
論文参考訳（メタデータ） (2023-04-18T15:13:10Z)
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [14.929695160346276]
大規模言語モデル(LLM)は優れた性能を示すが、計算とメモリ集約性がある。 SmoothQuant, トレーニング不要, 精度保存, 汎用的なポストトレーニング量子化ソリューションを提案する。最大1.56倍の高速化と2倍のメモリ削減を実現した。
論文参考訳（メタデータ） (2022-11-18T18:59:33Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。