Fugu-MT 論文翻訳(概要): ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

論文の概要: ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

arxiv url: http://arxiv.org/abs/2601.20755v1
Date: Wed, 28 Jan 2026 16:39:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-29 15:46:07.021309
Title: ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler
Title（参考訳）: ProfInfer: eBPFベースのファイングラインドLDM推論プロファイラ
Authors: Bohua Zou, Debayan Roy, Dhimankumar Yogesh Airao, Weihao Xu, Binqi Sun, Yutao Liu, Haibo Chen,
Abstract要約: 最新の推論エンジンのためのきめ細かな非侵入型プロファイリングフレームワークを開発した。私たちのシステムは、ソースの変更や再コンパイルなしに、複数のレイヤにわたるランタイム関数にプローブをアタッチします。収集されたトレースを演算子、グラフ、タイムライン、ハードウェアカウンタトレンドのリッチな視覚化に変換する。
参考スコア（独自算出の注目度）: 4.191309912359899
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today's LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions -- is this workload memory-bound or compute-bound? -- often remain unanswered. To close this gap, we develop a fine-grained, non-intrusive profiling framework for modern LLM inference engines, exemplified by llama.cpp but applicable to similar runtime architectures. Built on extended Berkeley Packet Filter (eBPF) technology, our system dynamically attaches probes to runtime functions across multiple layers -- without modifying or recompiling the source. It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends, exposing how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice. With less than 4% runtime overhead and high profiling fidelity, our framework makes LLM inference both transparent and diagnosable, turning performance profiling into a practical tool for optimization, scheduling, and resource-aware deployment.
Abstract（参考訳）: 大規模言語モデル(LLM)が研究から本番環境へ移行するにつれ、推論エンジンがリアルタイムでどのように振る舞うかを理解することは、必須かつ解明的になった。 ONNX Runtimeのような汎用エンジンとは異なり、今日のLLM推論システムは演算子レベルの可視性をほとんど提供せず、開発者は時間とリソースの行き先を見失っている。このワークロードはメモリバウンドなのか、それとも計算バウンドなのか? しばしば未回答のままである。このギャップを埋めるために、llama.cppで例示される現代のLLM推論エンジンのための細粒度で非侵襲的なプロファイリングフレームワークを開発し、同様のランタイムアーキテクチャに適用する。拡張バークレーパケットフィルタ(eBPF)技術に基づいて構築された当社のシステムは、ソースの変更や再コンパイルなしに、複数のレイヤにわたるランタイム関数にプローブを動的にアタッチする。収集されたトレースを演算子、グラフ、タイムライン、ハードウェアカウンタトレンドのリッチな視覚化に変換することで、密集した推論、Mixture-of-Expertsルーティング、オペレータのオフロードが実際にどのように振舞うかを明らかにする。実行時のオーバーヘッドが4%未満でプロファイリングの忠実度が高いため、当社のフレームワークはLLM推論を透過的かつ診断可能にし、パフォーマンスプロファイリングを最適化、スケジューリング、リソース対応デプロイメントのための実用的なツールにします。

論文の概要: ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

関連論文リスト