Fugu-MT 論文翻訳(概要): NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

論文の概要: NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

arxiv url: http://arxiv.org/abs/2403.01273v1
Date: Sat, 2 Mar 2024 17:29:22 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-05 14:29:55.198248
Title: NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Title（参考訳）: NoMAD-Attention: Multiply-add-free Attention による CPU 上での効率的な LLM 推論
Authors: Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava
Abstract要約: NoMAD-Attentionは、MAD操作を登録内ルックアップに置き換える効率的なアテンションアルゴリズムである。 NoMAD-AttentionはSIMDレジスタへの高速アクセスを繰り返すことで注目スコアの計算を行う。 In this show that NoMAD-Attention has well to the quality of the original LLMs and improve up the 4bit Quantized LLaMA-7B-based model to up 2$times$ at 16k context length。
参考スコア（独自算出の注目度）: 35.76200005898016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2$\times$ at 16k context length. Our results are reproducible at https://github.com/tonyzhang617/nomad-dist.
Abstract（参考訳）: 集中処理ユニット(CPU)における大規模言語モデル推論は、注意計算において大量の高価なマルチプライアドアドアド(MAD)行列演算のために困難である。本稿では,最近のcpuにはsimd(single-instruction-multiple-data)レジスタがあり,バッチで超低遅延のルックアップを可能にする。我々は、MAD操作を登録内ルックアップに置き換える効率的な注意アルゴリズムであるNoMAD-Attentionを提案する。 NoMAD-Attentionはハードウェアを意識したアルゴリズム設計を通じて,SIMDレジスタへの高速な繰り返しアクセスによるアテンションスコアの計算を実現する。さらに、NoMAD-Attentionは、モデル微調整なしで、事前学習された注意に基づくLLMで動作する。実証的な評価では、NoMAD-Attentionは元のLLMの品質をよく維持し、4ビット量子化LLaMA-7Bベースのモデルを最大2$\times$で16kコンテキスト長で高速化する。結果はhttps://github.com/tonyzhang617/nomad-distで再現できます。

関連論文リスト

Skipping Computations in Multimodal LLMs [63.29737699997859]
本研究では,マルチモーダル大言語モデル(MLLM)における推論時の冗長性について検討する。ブロック全体,FFN,自己保持層をスキップするなど,計算をスキップするさまざまな手法を提案する。本研究は,推定時に大量の計算を回避できることを実証した。
論文参考訳（メタデータ） (2024-10-12T09:21:45Z)
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
大規模言語モデル(LLM)は、機械学習における様々なタスクにおいて優れたパフォーマンスを示す。 LLM推論のデプロイは、高い計算とメモリ要求のために問題となる。我々は,低精度でLLM推論を効率的に展開できるアルゴリズム-ハードウェア共設計ソリューションであるテンダーを提案する。
論文参考訳（メタデータ） (2024-06-16T09:51:55Z)
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
大規模言語モデル(LLM)は、最近、多くの言語処理タスクに対処するための強力なツールとして登場した。勾配勾配勾配を用いた効率的なモデル収束に必要な重要な成分を同定し,特徴付ける。この結果から, 微調整と事前学習の両方のための, 安価かつメモリ効率のよいアルゴリズムが得られた。
論文参考訳（メタデータ） (2024-05-28T09:23:14Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
大規模言語モデル(LLM)は自然言語処理(NLP)に革命をもたらしたが、そのサイズは計算のボトルネックを生み出している。そこで本研究では,高性能LLMの高精度かつ疎結合な基本バージョンを作成するための新しいアプローチを提案する。スパース量子化LLaMAの最大8.6倍のCPU上での総高速化を示す。
論文参考訳（メタデータ） (2024-05-06T16:03:32Z)
Self-Selected Attention Span for Accelerating Large Language Model Inference [10.305434265471938]
大規模言語モデル(LLM)は困難なタスクを解くことができる。 LLMの推論計算は、新しいトークンを生成する際に出席しなければならないトークンの数が増えるため、非常に非効率である。 LLMの問題解決能力を利用して、推論時間の効率を最適化する。
論文参考訳（メタデータ） (2024-04-14T19:36:04Z)
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
MLLM(Multi-modal Large Language Models)のための新しいパラメータと計算効率のチューニング手法を提案する。 The Efficient Attention Skipping (EAS) method evaluate the attention redundancy and skips the less important MHAs to speed up inference。実験により、EASは高い性能とパラメータ効率を維持するだけでなく、推論速度を大幅に高速化することが示された。
論文参考訳（メタデータ） (2024-03-22T14:20:34Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
視覚的トークンに対する注意計算は,LVLMの深い層において極めて非効率であることがわかった。本稿では,計算効率の最適化を目的とした多用途プラグアンドプレイ方式であるFastVを紹介する。
論文参考訳（メタデータ） (2024-03-11T14:35:32Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLMは、事前訓練された大規模言語モデルに適した1ビット後のトレーニング後の量子化スキームである。 LLaMA2-70Bの8.41パープレキシティは、様々なLLMファミリーで1.08ビットの重みしか持たない。
論文参考訳（メタデータ） (2024-02-06T09:26:34Z)
Efficient LLM Inference on CPUs [8.802223672775844]
大規模言語モデル(LLM)は、幅広いタスクにおいて、顕著なパフォーマンスと大きなポテンシャルを示してきた。これらのモデルのデプロイは、天文学的なモデルパラメータの量のために困難でした。 LLMのデプロイをより効率的にするための効果的なアプローチを提案する。
論文参考訳（メタデータ） (2023-11-01T13:08:50Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。