Fugu-MT 論文翻訳(概要): LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

論文の概要: LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

arxiv url: http://arxiv.org/abs/2601.17768v1
Date: Sun, 25 Jan 2026 09:58:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-27 15:23:08.326504
Title: LLM-42: Enabling Determinism in LLM Inference with Verified Speculation
Title（参考訳）: LLM-42: 検証された推測によるLLM推論における決定性の実現
Authors: Raja Gond, Aditya K Kamath, Arkaprava Basu, Ramachandran Ramjee, Ashish Panwar,
Abstract要約: LLM推論では、同じプロンプトが異なるランで異なるアウトプットを生成する。この非決定論は、浮動小数点非結合性と動的トークンの組み合わせから生じる。推論における決定性を実現するためのスケジューリングベースのアプローチであるLSM-42を提案する。
参考スコア（独自算出の注目度）: 9.210733890540814
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non-determinism arises from floating-point non-associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non-determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch-invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM-42, a scheduling-based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token is likely to be consistent even with dynamic batching. Moreover, most GPU kernels use shape-consistent reductions. Leveraging these insights, LLM-42 decodes tokens using a non-deterministic fast path and enforces determinism via a lightweight verify-rollback loop. The verifier replays candidate tokens under a fixed-shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. LLM-42 mostly re-uses existing kernels unchanged and incurs overhead only in proportion to the traffic that requires determinism.
Abstract（参考訳）: LLM推論では、同じプロンプトが異なるランで異なるアウトプットを生成する。システムレベルでは、この非決定性は浮動小数点非連想性と動的バッチとGPUカーネルの組み合わせによって生じる。非決定性を排除するための簡単な方法は、推論中に動的バッチを無効にすることだが、スループットを著しく低下させる。もうひとつのアプローチは、カーネルをバッチ不変にすることだが、これはカーネル設計に決定性を密結合させ、新しい実装を必要とする。この結合はまた、実際にワークロードが決定性を必要とするかに関わらず、固定されたランタイムオーバーヘッドを課します。投機的復号法から着想を得たLLM-42を提案する。私たちのキーとなる観察は、シーケンスが一貫した状態であれば、次の出力トークンは動的バッチ処理でも一貫性がある可能性が高いということです。さらに、ほとんどのGPUカーネルは形状に一貫性のあるリダクションを使用する。これらの洞察を活用して、LCM-42は、非決定論的高速パスを使用してトークンをデコードし、軽量な検証ロールバックループを通じて決定性を強制する。検証者は、固定形還元スケジュールの下で候補トークンをリプレイし、実行中に一貫性があることを保証するトークンをコミットし、違反する決定をロールバックする。 LLM-42は主に既存のカーネルを再使用し、決定性を必要とするトラフィックに比例してオーバーヘッドを発生させる。

論文の概要: LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

関連論文リスト