Fugu-MT 論文翻訳(概要): The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

論文の概要: The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

arxiv url: http://arxiv.org/abs/2605.19537v2
Date: Wed, 20 May 2026 07:11:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 14:55:44.401187
Title: The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
Title（参考訳）: サイレントハイパーパラメーター:LLM再現性に対する推論バックエンドの影響の定量化
Authors: David Pape, Jonathan Evertz, Lea Schönherr,
Abstract要約: バックエンドのみを選択することで、ベンチマークスコアを最大16.6ポイントシフトできることが示されています。これは、キャッシュやグラフ、カスタムカーネル、ロジット処理におけるエンジン固有のデフォルトなど、システムレベルの最適化によって実現されています。
参考スコア（独自算出の注目度）: 4.514361164656055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and llama$.$cpp, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.
Abstract（参考訳）: LLMの進歩は標準化されたベンチマークによってますます測定され、最先端の改善はしばしばパーセンテージポイントの分数で分離される。同時に、現代のLLMを評価するための計算コストは、推論時に効率的にトレーニングされたモデルを実行するソフトウェアシステムである特別な推論バックエンドを広く採用するきっかけとなった。拡張性には欠かせないが、カスタムCUDAカーネルや縮小精度演算のようなシステムレベルの最適化はトークンの確率を変更し、非決定性を導入し、おそらく分岐生成にカスケードする。本研究ではまず,200個の異なるエンジンを識別し,35,000個のML出版物を解析し,その多様性にもかかわらず,特定の推論スタックが報告されることは滅多にないことを示した。次に、推論バックエンドがLLMベンチマーク結果にどのように影響するかについて、体系的な実証的研究を行う。モデル重み、デコードパラメータ、ハードウェア定数を保持することで、vLLM、SGLang、llama$を含む5つの広く使われている推論エンジンを評価した。複数のオープンウェイトモデルと確立されたベンチマークにまたがる$cpp。バックエンドのみの選択は、ベンチマークスコアを最大16.6ポイントシフトし、高い出力不一致率を誘導できることを示す。バックエンドの最適化を分離し、実行パイプラインをトレースすることにより、この分散は、プレフィックスキャッシングやCUDAグラフ、カスタムカーネル、ロジット処理におけるエンジン固有のデフォルトといった、システムレベルの最適化によって実現される。本研究は, 推論バックエンドを, LLMの評価において未報告の連続ハイパーパラメータとして認識し, ベンチマーク比較の再現性と解釈性を改善するために, 推論スタックの標準化報告を提唱した。

論文の概要: The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

関連論文リスト