Fugu-MT 論文翻訳(概要): Efficient, VRAM-Constrained xLM Inference on Clients

論文の概要: Efficient, VRAM-Constrained xLM Inference on Clients

arxiv url: http://arxiv.org/abs/2604.26334v1
Date: Wed, 29 Apr 2026 06:35:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-30 15:59:36.273997
Title: Efficient, VRAM-Constrained xLM Inference on Clients
Title（参考訳）: クライアント上での効率的な VRAM 制約付き xLM 推論
Authors: Aditya Ukarande, Deep Shekhar, Marc Blackstein, Ram Rangan,
Abstract要約: 本稿では,ベンチマークによる新しいCPU-GPUハイブリッドスケジューリング手法であるパイプラインシャーディングを提案する。クライアントシステム上での高密度および混合仕様(MoE)大言語モデル(LLM)のVRAM制約による効率的な推論を実現する。本論文は2026年の第9回MLSys Conference (Industry Track)で受け入れられた。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelined sharding, a novel, benchmark-profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-of-experts (MoE) LLMs. Using a combination of model sharding at the sub-layer level, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama.cpp implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing software development kit (IGI SDK) and the Cosmos-Reason1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: for interactive use, TTFT improves by up to 6.7x and TPS by up to 30x for LLMs, and CR1 inference's VRAM demand is down by 10x, while in batched mode, throughput improves by up to 8.2x, all compared to their respective aggressive baselines. This paper is accepted at the 9th MLSys Conference (Industry Track), 2026. Code and artifact available at: https://github.com/deepshnv/pipeshard-mlsys26-ae
Abstract（参考訳）: クライアントAIイノベーションの次のラウンドを補助するために、クライアントシステム上で、高精度の大規模言語モデル(LLM)とビジョン言語モデル(VLM)の効率的で損失のない推論を可能にする必要がある。そこで本研究では,高密度・高密度・高密度のLLM(MoE)に対するVRAM制約付き推論を実現するために,ベンチマークに注目する新しいCPU-GPUハイブリッドスケジューリング手法であるパイプラインシャーディングを提案する。サブレイヤレベルでのモデルシャーディング、CPUオフロード、パイプライン化されたコピーコンプット、VRAMにおけるテンソル配置の優先順位付けの組み合わせを使用して、システムや推論条件に柔軟に対応しつつ、TTFT(Time-to-first-token)メトリクスとトークン/秒(TPS)メトリクスの両方を最適化する。高速かつ高精度なVLM推論のために、パイプラインシャーディングとVLMOptと呼ばれる3つのよく理解された事前アイデアのラマ.cpp実装を組み合わせる。これらの拡張は、将来のNVIDIA製品であるIn-Game Inference Software Development Kit(IGI SDK)とCosmos-Reason1(CR1)物理AI推論VLMのクライアントxLM推論を改善することを目的としている。複数のモデルとクライアントシステムにまたがる厳格な評価のハイライトは、インタラクティブな使用では、TTFTはLLMの最大6.7倍、TPSは最大30倍、CR1推論のVRAM需要は10倍、バッチモードではスループットは最大8.2倍改善され、それぞれ攻撃的ベースラインと比較される。本論文は2026年の第9回MLSys Conference (Industry Track)で受け入れられた。 https://github.com/deepshnv/pipeshard-mlsys26-ae

論文の概要: Efficient, VRAM-Constrained xLM Inference on Clients

関連論文リスト