Fugu-MT 論文翻訳(概要): APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

論文の概要: APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

arxiv url: http://arxiv.org/abs/2508.19087v1
Date: Tue, 26 Aug 2025 14:48:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-27 17:42:38.888867
Title: APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration
Title（参考訳）: APT-LLM: LLM加速のための任意精度テンソルコアコンピューティングの爆発
Authors: Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang,
Abstract要約: 大規模言語モデル(LLM)は、AIアプリケーションに革命をもたらしたが、その膨大な計算要求は、デプロイメントとリアルタイムのパフォーマンスを著しく制限している。これは主にGPU Coreの限定的なサポート、非効率なメモリ管理、非フレキシブルなカーネル最適化が原因である。本稿では,任意の精度のLLM,すなわちAPT-LLMに対する包括的加速法を提案する。
参考スコア（独自算出の注目度）: 5.075697428779204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$\times$ speedup compared to FP16 baselines and a 2.16$\times$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$\times$ speedup over FP16 and 1.65$\times$ speedup over CUTLASS integer baselines.
Abstract（参考訳）: 大規模言語モデル(LLM)は、AIアプリケーションに革命をもたらしたが、その膨大な計算要求は、デプロイメントとリアルタイムのパフォーマンスを著しく制限している。量子化法は計算コストの削減に有効であるが、超低ビット量子化LDMに付随する極端に効率が良く、任意の精度でGPUに挑戦する。これは主にGPU Tensor Cores、非効率なメモリ管理、非フレキシブルカーネル最適化のサポートが制限されているためである。これらの課題に対処するために、任意の精度のLCM、すなわちAPT-LLMの総合的な加速度スキームを提案する。まず,新しいデータ形式であるbipolar-INTを導入し,符号付きINTで効率よく,ロスレスな変換を実現する。また,行列をビットレベルで分解・再組み立てすることで任意の精度で行列乗法(MatMul)を開発する。この方法はフレキシブルな精度を提供し、GPUテンソルコアの利用を最適化する。さらに,高速共有メモリを戦略的に活用し,カーネル実行速度を大幅に向上し,メモリアクセス遅延を低減する,データリカバリに着目したメモリ管理システムを提案する。最後に,カーネルの最適設定可能なハイパーパラメータを動的に選択し,異なるLLMアーキテクチャと精度設定で最適な性能を実現するカーネルマッピング手法を開発した。 LLM推論では、ATT-LLMはFP16ベースラインと比較して最大3.99$\times$スピードアップ、RTX 3090上のNVIDIA CUTLASS INT4アクセラレーションよりも2.16$\times$スピードアップを達成する。 RTX 4090とH800では、APT-LLMはFP16よりも2.44$\times$スピードアップ、CUTLASSの整数ベースラインより1.65$\times$スピードアップを達成する。

論文の概要: APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

関連論文リスト