Related papers: Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

URL: http://arxiv.org/abs/2506.09501v1
Date: Wed, 11 Jun 2025 08:23:53 GMT
Title: Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning
Authors: Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu,
Abstract summary: This work presents the first systematic investigation into how numerical precision affects Large Language Models inference.<n>We develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.<n>Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.
Score: 54.970571745690634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.

Related papers

Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference [3.7687375904925484]
We propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation.<n>We develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead.
arXiv Detail & Related papers (2025-05-20T17:26:12Z)
Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems. Such problems are encountered in medicine, physics, and machine learning. We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z)
Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications [0.0]
Run to run variability in parallel programs caused by floating-point non-associativity has been known to significantly affect algorithms. We investigate the statistical properties of floating-point non-associativity within modern parallel programming models. We examine the recently-added deterministic options in PyTorch within the context of GPU deployment for deep learning.
arXiv Detail & Related papers (2024-08-09T16:07:37Z)
Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection [58.789823426981044]
We propose a novel auxiliary loss formulation that aims to align the class confidence of bounding boxes with the accurateness of predictions. Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios.
arXiv Detail & Related papers (2023-03-25T08:56:21Z)
Fixflow: A Framework to Evaluate Fixed-point Arithmetic in Light-Weight CNN Inference [0.0]
Convolutional neural networks (CNN) are widely used in resource-constrained devices in IoT applications. This paper investigates the effect of different fixed-point hardware units on CNN inference accuracy.
arXiv Detail & Related papers (2023-02-19T12:48:52Z)
Numerical Stability of DeepGOPlus Inference [1.5361702135159845]
Convolutional neural networks (CNNs) are currently among the most widely-used deep neural network (DNN) architectures. Recent works have highlighted numerical stability challenges in DNNs, which also relates to their known sensitivity to noise injection. This paper investigates DeepGOPlus, a CNN that predicts protein function.
arXiv Detail & Related papers (2022-12-13T03:52:14Z)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)
I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models. I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
A Framework for Semi-Automatic Precision and Accuracy Analysis for Fast and Rigorous Deep Learning [1.5863809575305419]
Many papers experimentally observe that Deep Neural Networks (DNNs) can successfully run at almost ridiculously low precision. This paper sheds some theoretical light upon why a DNN's FP accuracy stays high for low FP precision. We present a software framework for FP error analysis for the inference phase of deep-learning.
arXiv Detail & Related papers (2020-02-10T15:33:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.