Fugu-MT 論文翻訳(概要): Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

論文の概要: Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

arxiv url: http://arxiv.org/abs/2605.30571v1
Date: Thu, 28 May 2026 21:03:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-01 20:56:50.232739
Title: Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode
Title（参考訳）: 帯域幅制限のないメモリバウンド:Batch-1 LLMデコードにおける物理AI推論ギャップ
Authors: Josef Chen,
Abstract要約: 物理AIシステムは、クラウドのLLMサービスとは異なるワークロードを実行する。 4つのNVIDIA GPUにわたる7から8BクラスのGQA変換器のバッチ1デコードを測定する。ピーク帯域幅はピーク帯域幅が増加するにつれて減少する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.
Abstract（参考訳）: ロボット、自動運転車、エンボディエージェント、エッジコピロットを含む物理的なAIシステムは、LLMが提供するクラウドとは異なる推論ワークロードを実行することが多い。このワークロードは通常、メモリバンド幅バウンドとして記述される。各デコードステップストリームは重みとアクティブなKVキャッシュをモデル化する。この説明は事実だが不完全である。 H100 SXM5, A100-80GB SXM4, L40S, L4の4つのNVIDIA GPUにまたがる7～8BクラスのGQA変換器のバッチ-1デコードを測定する。 2048年から16384年までの文脈長を評価し,制御されたbf16 SDPA設定下で44個の有効な細胞を生産した。ピークHBM帯域幅はピーク帯域幅が増加するにつれて減少する。ヘッドラインのQwen-2.5-7B ctx=2048セルでは、L4は分析メモリフロアの約81%、H100はわずか27%である。物理AIデコードはメモリが支配されるが、高速なメモリは比例レイテンシゲインに変換されない。 CUDAグラフA/B実験で欠落項を検証した。 ctx=2048のH100では、CUDA Graphsは、N=10の新たなセッションでDecodeレイテンシを1.259倍改善し、95%のブートストラップ信頼区間が1.253から1.267である。 L4では、同じ介入で1.028倍しか得られない。これにより起動側のオーバーヘッドが分離され、高速GPUでは見えますが、ほとんどの場合、遅い帯域幅のGPUでは隠されています。デプロイメントの意味は、メモリの節約はランタイムがそれを実現するときだけに重要である、ということです。 L4では、bf16デコードはメモリフロアの近くに置かれるが、一般的な量子化された経路は、期待される4倍の重量-交通量の減少を回復しない:bnb-nf4は59.36ms/step、AutoAWQ+Marlinは62.32msbf16ベースラインから45.24ms/stepに達する。 GPTQ+ExLlamaV2はAda-tuned int4カーネルで17.36ms/stepに達する。

論文の概要: Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

関連論文リスト