Fugu-MT 論文翻訳(概要): Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

論文の概要: Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

arxiv url: http://arxiv.org/abs/2506.09501v1
Date: Wed, 11 Jun 2025 08:23:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-13 06:35:02.743672
Title: Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning
Title（参考訳）: 私にFP32を与えるか、死を与えるか? 再現可能な推論への挑戦と解決策
Authors: Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu,
Abstract要約: 本研究は,数値精度が大規模言語モデルの推論に与える影響について,最初の系統的研究を行った。我々は16ビットの精度で重みを格納するが、FP32では全ての計算を実行する軽量な推論パイプラインであるLayerCastを開発した。そこで我々は16ビットの精度で重みを格納するが、FP32では全ての計算を実行する軽量な推論パイプラインLayerCastを開発した。
参考スコア（独自算出の注目度）: 54.970571745690634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.
Abstract（参考訳）: 大規模言語モデル(LLM)は現在、さまざまなドメインで統合されており、素晴らしいパフォーマンスを示している。しかし、進捗はベンチマークスコアが正確かつ再現可能であるという前提に基づいている。評価バッチサイズ,GPUカウント,GPUバージョンといったシステム構成の変更は,生成した応答に大きな違いをもたらす可能性がある。この問題は特に推論モデルにおいて顕著であり、初期トークンの小さな丸めの違いは思考の分岐連鎖にカスケードし、最終的には精度に影響を及ぼす。例えば、greedyデコーディングによるbfloat16の精度の下では、DeepSeek-R1-Distill-Qwen-7Bのような推論モデルは、GPUカウント、タイプ、評価バッチサイズの違いによって、最大9%の精度と9000トークンのレスポンス長の違いを示すことができる。この変動の根本原因は、限定的な数値精度で浮動小数点算術の非連想性に遡る。本研究は,LLM推論における数値精度が再現性に与える影響について,最初の系統的研究を行った。様々なハードウェア、ソフトウェア、精度設定を綿密に制御した実験により、モデル出力がいつ、どのように分岐するかを定量化する。我々の分析によると、再現性に批判的な浮動小数点精度は、評価プラクティスでは無視されることが多い。これは16ビットの精度で重みを格納するが、FP32では全ての計算を行い、メモリ効率と数値安定性のバランスをとる。コードはhttps://github.com/nanomaoli/llm_reproducibilityで入手できる。

論文の概要: Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

関連論文リスト