Related papers: LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

URL: http://arxiv.org/abs/2601.17768v2
Date: Fri, 30 Jan 2026 17:59:09 GMT
Title: LLM-42: Enabling Determinism in LLM Inference with Verified Speculation
Authors: Raja Gond, Aditya K Kamath, Ramachandran Ramjee, Ashish Panwar,
Abstract summary: In LLM inference, the same prompt may yield different outputs across different runs.<n>This nondeterminism arises from floating-point non-associativity combined with dynamic tokens.<n>We present LLM-42, a scheduling-based approach to enable determinism in inference.
Score: 9.080984801328606
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non-determinism arises from floating-point non-associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non-determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch-invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM-42, a scheduling-based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token is likely to be consistent even with dynamic batching. Moreover, most GPU kernels use shape-consistent reductions. Leveraging these insights, LLM-42 decodes tokens using a non-deterministic fast path and enforces determinism via a lightweight verify-rollback loop. The verifier replays candidate tokens under a fixed-shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. LLM-42 mostly re-uses existing kernels unchanged and incurs overhead only in proportion to the traffic that requires determinism.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
Stochastic CHAOS: Why Deterministic Inference Kills, and Distributional Variability Is the Heartbeat of Artifical Cognition [14.945980804235885]
We argue that, for LLMs, deterministic inference kills.<n>It kills the ability to model uncertainty, suppresses emergent abilities, collapses reasoning into a single brittle path, and weakens safety alignment by hiding tail risks.
arXiv Detail & Related papers (2026-01-12T06:19:09Z)
Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch [21.951981326540878]
Existing LLM serving frameworks exhibit non-deterministic behavior.<n>This arises from the non-associativity of floating-point arithmetic.<n>We propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives.
arXiv Detail & Related papers (2025-11-21T22:40:00Z)
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs [52.663816303997194]
A key factor influencing answer quality is the length of the thinking stage.<n>This paper explores and exploits the mechanisms by which LLMs understand and regulate the length of their reasoning.<n>Our results demonstrate that this "overclocking" method mitigates overthinking, improves answer accuracy, and reduces inference latency.
arXiv Detail & Related papers (2025-06-08T17:54:33Z)
DINGO: Constrained Inference for Diffusion LLMs [5.971462597321995]
Diffusion models lack the ability to provably enforce user-specified formal constraints.<n>We propose DINGO, a dynamic programming-based decoding strategy that is both efficient and provably distribution-preserving.
arXiv Detail & Related papers (2025-05-29T04:04:54Z)
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks.<n>We propose a novel approach for continuous-space reasoning that does not require modifying the LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
Efficiently Scaling LLM Reasoning with Certaindex [25.549811985276488]
Test-time reasoning algorithms can wastefully generate many tokens without improving accuracy.<n>We introduce Certaindex, an algorithm-agnostic metric measuring when further computation is unlikely to alter the final result.<n>Certaindex is lightweight, can accelerate reasoning program inference via early exit, and enables dynamic token allocation.
arXiv Detail & Related papers (2024-12-30T14:57:53Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.