Fugu-MT 論文翻訳(概要): TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

論文の概要: TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

arxiv url: http://arxiv.org/abs/2603.12465v1
Date: Thu, 12 Mar 2026 21:30:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.77152
Title: TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition
Title（参考訳）: TaxBreak: オーバーヘッド分解によるLCM推論の隠れたコストを解き明かす
Authors: Prabhu Vellaisamy, Shreesh Tripathi, Vignesh Natarajan, Surya Santhan Thenarasu, Shawn Blanton, John P. Shen,
Abstract要約: この作業では、ホスト可視のオーケストレーションオーバーヘッドを分解するトレース駆動の方法論であるTaxBreakを紹介している。 NVIDIA H100およびH200システム上でTaxBreakを検証し、提案したホストデバイスバランス指標(HDBI)を導出する。我々は,MoEモデルが高密度モデルよりも出力トークン当たり8～11倍のカーネルをディスパッチし,ホストバウンドワークロードでは,シングルスレッド性能が1次パラメータであることを示す。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.
Abstract（参考訳）: 大規模言語モデル(LLM)推論は対話型アシスタントやエージェントシステムで広く使われている。レイテンシに敏感なデプロイメントでは、推論時間はホスト側のオーバーヘッドに支配される。既存のアプローチでは、このコストをアグリゲート残量またはローンチ/キューメトリックとしてのみ公開するが、どの実行層を最適化すべきかを特定するのにはしばしば不十分である。これは、ホスト可視のオーケストレーションオーバーヘッドをフレームワーク翻訳時間、CUDAライブラリ翻訳時間、カーネル起動パス時間という3つのコンポーネントに分解するトレース駆動の方法論である。 NVIDIA H100およびH200システム上でTaxBreakを検証するとともに、デバイスアクティブ実行とホスト可視オーケストレーションを関連づけたバウンダリ要約インデックスである、HBI(Host-Device Balance Index)を導出する。プリフィルとデコードの両方において、代表的密集ワークロードと熟練ワークロードの混在によって、アグリゲートレイテンシ、GPU不活性、バウンダリネス比のみが、主要な最適化目標を曖昧にする可能性があることを示す。代わりにTaxBreakは、最適化がソフトウェアスタックのオーバーヘッドを減らすべきケースと、デバイス側の作業を減らすことが主な利益をもたらすケースとを区別する。より高速なホストCPUは、オーケストレーションのオーバーヘッドを10～29%削減し、低クロックGPUと組み合わせても、エンドツーエンドのレイテンシを最大14%改善する。これらの結果はTaxBreakを、最適化作業がソフトウェアスタックやデバイス側のワークロード実行をターゲットにすべきかどうかを評価する診断ツールとして位置づけている。

論文の概要: TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

関連論文リスト