Related papers: Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

URL: http://arxiv.org/abs/2511.17826v1
Date: Fri, 21 Nov 2025 22:40:00 GMT
Title: Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch
Authors: Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, Zirui Liu,
Abstract summary: Existing LLM serving frameworks exhibit non-deterministic behavior.<n>This arises from the non-associativity of floating-point arithmetic.<n>We propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives.
Score: 21.951981326540878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.

Related papers

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners [69.66089681814013]
$V_$ is a framework that unifies generation and verification through efficient pairwise ranking.<n>$V_$-Infer improves Pass@1 by up to $10%$ over pointwise verification.<n>$V_$-PairRL achieves $7$--$9%$ test-time scaling gains over standard RL and pointwise joint training.
arXiv Detail & Related papers (2026-03-04T17:22:16Z)
LLM-42: Enabling Determinism in LLM Inference with Verified Speculation [9.080984801328606]
In LLM inference, the same prompt may yield different outputs across different runs.<n>This nondeterminism arises from floating-point non-associativity combined with dynamic tokens.<n>We present LLM-42, a scheduling-based approach to enable determinism in inference.
arXiv Detail & Related papers (2026-01-25T09:58:57Z)
DeepPrune: Parallel Scaling without Inter-trace Redundancy [53.62015294143274]
Over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation.<n>We propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning.<n>Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient.
arXiv Detail & Related papers (2025-10-09T17:24:54Z)
Rethinking Thinking Tokens: LLMs as Improvement Operators [80.12087211785949]
Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which allows them to explore solution strategies with self-checking.<n>This results in higher accuracy, but inflates context length, token/compute cost, and answer latency.<n>We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier?<n>We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace
arXiv Detail & Related papers (2025-10-01T17:08:59Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference [31.2331188304598]
Changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses.<n>We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision.<n>Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.
arXiv Detail & Related papers (2025-06-11T08:23:53Z)
Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability [4.054484966653432]
A key measure of machine learning (ML) classification models' safety and reliability is their ability to resist small, targeted input perturbations.<n>We show that floating-point non-associativity coupled with asynchronous parallel programming on GPU is sufficient to result in misclassification.<n>We also show that standard adversarial robustness results may be overestimated up to 4.6 when not considering machine-level details.
arXiv Detail & Related papers (2025-03-21T14:19:45Z)
RTP: Rethinking Tensor Parallelism with Memory Deduplication [3.036340414461332]
Rotated Parallelism (RTP) is an innovative approach that focuses on memory deduplication in distributed training environments. Our empirical evaluations underscore RTP's efficiency, revealing that its memory consumption during distributed system training is remarkably close to the optimal.
arXiv Detail & Related papers (2023-11-02T23:12:42Z)
Distributed Extra-gradient with Optimal Complexity and Communication Guarantees [60.571030754252824]
We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local dual vectors. Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communication-efficient. We propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs.
arXiv Detail & Related papers (2023-08-17T21:15:04Z)
Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO) TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models. Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z)
Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference. It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.