Related papers: Reliability Under Randomness: An Empirical Analysis of Sparse and Dense Language Models Across Decoding Temperatures

Reliability Under Randomness: An Empirical Analysis of Sparse and Dense Language Models Across Decoding Temperatures

URL: http://arxiv.org/abs/2601.00942v1
Date: Fri, 02 Jan 2026 18:10:10 GMT
Title: Reliability Under Randomness: An Empirical Analysis of Sparse and Dense Language Models Across Decoding Temperatures
Authors: Kabir Grover,
Abstract summary: We investigate whether conditional computation in sparse MoE models amplifies decoding-induced randomness, leading to reduced reliability as temperature increases.<n>Results demonstrate that the sparse instruction-tuned model exhibits stability comparable to the dense instruction-tuned model across all decoding temperatures.<n>We discuss the implications of these results for deploying sparse language models in reliability-critical applications.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing prevalence of sparse Mixture-of-Experts (MoE) architectures in large language models raises important questions regarding their reliability under stochastic decoding. While conditional computation enables substantial gains in computational efficiency, it remains unclear whether the interaction between sparse routing and temperature-based sampling compromises output stability relative to dense architectures. This work investigates whether conditional computation in MoE models amplifies decoding-induced randomness, leading to reduced reliability as temperature increases. We evaluate three representative models: OLMoE-7B (sparse base), Mixtral-8x7B (sparse instruction-tuned), and Qwen2.5-3B (dense instruction-tuned) on deterministic arithmetic reasoning tasks with objectively verifiable answers. Experiments span four decoding configurations, ranging from greedy decoding to T=1.0. Our evaluation encompasses accuracy, format compliance, output consistency across repeated generations, and confidence metrics, totaling 9,360 model generations. Results demonstrate that the sparse instruction-tuned model exhibits stability comparable to the dense instruction-tuned model across all decoding temperatures, while the sparse base model shows systematic degradation as temperature increases. These findings indicate that instruction tuning, rather than architectural sparsity, is the primary determinant of robustness to decoding randomness on deterministic tasks. We discuss the implications of these results for deploying sparse language models in reliability-critical applications, highlighting scenarios in which sparse architectures can be safely adopted without sacrificing output stability.

Related papers

Explaining Machine Learning Predictive Models through Conditional Expectation Methods [0.0]
MUCE is a model-agnostic method for local explainability designed to capture prediction changes from feature interactions.<n>Two quantitative indices, stability and uncertainty, summarize local behavior and assess model reliability.<n>Results show that MUCE effectively captures complex local model behavior, while the stability and uncertainty indices provide meaningful insight into prediction confidence.
arXiv Detail & Related papers (2026-01-12T08:34:36Z)
Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations [40.12950482269347]
We present PromptSE, a framework that creates semantically equivalent prompt variants with emotion and personality templates.<n>Our study shows that performance and stability behave as largely decoupled optimization objectives.<n>PromptSE enables practitioners to quantify performance stability trade offs for deployment and model selection.
arXiv Detail & Related papers (2025-09-17T04:17:42Z)
Analyzing the Instability of Large Language Models in Automated Bug Injection and Correction [0.0]
Large Language Models (LLMs) are used in software engineering tasks.<n>When executed at different times with the same input, they can generate radically different code.<n>This study examines how unstable an LLM is when it comes to fixing code bugs.
arXiv Detail & Related papers (2025-09-08T08:23:49Z)
Discrete Diffusion Models for Language Generation [0.0]
This thesis investigates the feasibility and performance of discrete diffusion models for natural language generation.<n>We use Bits Per Token (BPT), Negative Log-Likelihood (NLL), Perplexity (PPL), and Batch Processing Speed to assess generative performance.<n>The AR model outperforms in compression with a lower mean BPT of 4.59, but D3PM achieves higher processing speed, reaching up to 3.97 batches per sec.
arXiv Detail & Related papers (2025-07-02T23:43:02Z)
Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z)
ConDiSim: Conditional Diffusion Models for Simulation Based Inference [2.1493648495606354]
ConDiSim is a conditional diffusion model for simulation-based inference of complex systems with intractable likelihoods.<n>It is evaluated across ten benchmark problems and two real-world test problems, where it demonstrates effective posterior approximation accuracy.
arXiv Detail & Related papers (2025-05-13T09:58:23Z)
Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces. We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z)
Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees [57.67528738886731]
We study the numerical stability of scalable sparse approximations based on inducing points. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions.
arXiv Detail & Related papers (2022-10-14T15:20:17Z)
Robust Implicit Networks via Non-Euclidean Contractions [63.91638306025768]
Implicit neural networks show improved accuracy and significant reduction in memory consumption. They can suffer from ill-posedness and convergence instability. This paper provides a new framework to design well-posed and robust implicit neural networks.
arXiv Detail & Related papers (2021-06-06T18:05:02Z)
On the Discrepancy between Density Estimation and Sequence Generation [92.70116082182076]
log-likelihood is highly correlated with BLEU when we consider models within the same family. We observe no correlation between rankings of models across different families.
arXiv Detail & Related papers (2020-02-17T20:13:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.