Fugu-MT 論文翻訳(概要): Pushing the Envelope of LLM Inference on AI-PC

論文の概要: Pushing the Envelope of LLM Inference on AI-PC

arxiv url: http://arxiv.org/abs/2508.06753v1
Date: Fri, 08 Aug 2025 23:33:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:28.531314
Title: Pushing the Envelope of LLM Inference on AI-PC
Title（参考訳）: AI-PCにおけるLLM推論の展開
Authors: Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke,
Abstract要約: ウルトラロービットモデル(1/1.58/2-bit)は、同じモデルサイズを用いて、その完全精度のモデルのパープレキシティとエンドタスクのパフォーマンスとを一致させる。最先端の推論ランタイム(例えばbitnet)の計算効率は未調査のままである。まず1ビットと2ビットのマイクロカーネルを設計・実装し,計算効率の最大化を実現した。我々は、現在のSOTAランタイムビットネットよりも優れた2ビットモデルを用いて、エンドツーエンドの推論結果を示す。
参考スコア（独自算出の注目度）: 45.081663877447816
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. Our optimized runtime advances the state of LLM inference on AI PCs and edge devices, paving the way for efficient deployment of ultra-low-bit LLM models.
Abstract（参考訳）: 超低ビットLLMモデル(1/1.58/2-bit)の出現は、エッジデバイスやAIPCのようなリソース制約のある環境に対するLLM推論の新しい時代を辿りつつある。これらの量子化は、レイテンシ、メモリ、スループット、エネルギー消費の点でよりコスト効率のよいモデルを約束するが、それらを展開するのに使用されるステート・オブ・ザ・アート(SOTA)推論ランタイム(例:bitnet.cpp)の計算効率は未定のままである。我々はまず,最新のCPUに最適化された1ビットと2ビットのマイクロカーネルを設計し,実装し,様々なCPUプラットフォームでピーク計算効率を実現する。我々はこれらのマイクロカーネルを最先端のLCM推論フレームワークであるPyTorch-TPPに統合し、現在のSOTAランタイムbitnet.cppを最大2.2倍に上回る2ビットモデルによるエンドツーエンドの推論結果を16ビットモデル推論と比較して最大7倍のスピードアップを提供する。我々の最適化されたランタイムは、AIPCやエッジデバイス上でのLLM推論の状態を前進させ、超低ビットのLLMモデルの効率的な展開の道を開く。

論文の概要: Pushing the Envelope of LLM Inference on AI-PC

関連論文リスト