Fugu-MT 論文翻訳(概要): F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs

論文の概要: F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs

arxiv url: http://arxiv.org/abs/2510.13401v1
Date: Wed, 15 Oct 2025 10:56:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.630096
Title: F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs
Title（参考訳）: F-BFQ:LLM用フレキシブルブロック浮動小数点量子化加速器
Authors: Jude Haris, José Cano,
Abstract要約: 大きな言語モデル(LLM)は日々のタスクでますます顕著になっている。 LLMはリソース制約のあるエッジデバイス上で実行できる。 LLMは通常、モデル層をまたいだ混合BFP量子化によって定量化される。
参考スコア（独自算出の注目度）: 0.6302369456012739
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) have become increasingly prominent for daily tasks, from improving sound-totext translation to generating additional frames for the latest video games. With the help of LLM inference frameworks, such as llama.cpp, which support optimizations such as KV-caching and quantization, it is now easier than ever to deploy LLMs on edge devices. Quantization is fundamental to enable LLMs on resource-constrained edge devices, and llama.cpp utilizes block floating point (BFP) quantization to drastically reduce the bit width of weights and input tensors, the memory footprint, and the computational power required to run LLMs. LLMs are typically quantized with mixed BFP quantization across the model layers to reduce the loss of model accuracy due to quantization. Therefore, to efficiently accelerate across the layers of BFP-quantized LLMs, specialized accelerators need to support different BFP variants without reconfiguration. To address this issue, we propose a Flexible Block FloatingPoint Quantization (F-BFQ) accelerator, which can dynamically switch between two BFP quantization variants and perform matrix multiplication (MatMul) operations. Our initial F-BFQ accelerator design, deployed on the AMD Kria board, reduces inference time by 1.4x on average over the Arm NEON-based CPU execution across three BFP quantized LLMs while achieving 5.2 tokens per second (~3.9 words per second).
Abstract（参考訳）: 大きな言語モデル(LLM)は、音声テキスト翻訳の改善から最新のビデオゲームのための追加フレームの生成に至るまで、日々のタスクにおいてますます顕著になっている。 KVキャッシュや量子化などの最適化をサポートするllama.cppのようなLLM推論フレームワークの助けにより、エッジデバイスにLLMをデプロイするのはこれまで以上に簡単になった。リソース制約のあるエッジデバイス上でLLMを有効にするためには量子化が基本であり、llama.cppはブロック浮動小数点(BFP)量子化を利用して重みと入力テンソルのビット幅、メモリフットプリント、LLMの実行に必要な計算能力を大幅に削減する。 LLMは典型的には、量子化によるモデルの精度の損失を減らすために、モデル層をまたいだ混合BFP量子化で定量化される。したがって、BFP量子化LDMの層間を効率的に加速するためには、異なるBFP変種を再構成せずにサポートする必要がある。この問題に対処するために,2つのBFP量子化変種を動的に切り替え,行列乗算(MatMul)演算を行うフレキシブルブロック浮動小数点量子化(F-BFQ)アクセラレータを提案する。 AMD Kriaボード上にデプロイされた最初のF-BFQアクセラレータ設計では、3つのBFP量子化LLMに対して平均1.4倍の推論時間を削減し、毎秒5.2トークン(約3.9ワード)を実現した。

論文の概要: F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs

関連論文リスト