Related papers: Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning

Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning

URL: http://arxiv.org/abs/2602.10273v1
Date: Tue, 10 Feb 2026 20:31:40 GMT
Title: Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning
Authors: Seyedarmin Azizi, Erfan Baghaei Potraghloo, Minoo Ahmadi, Souvik Kundu, Massoud Pedram,
Abstract summary: We introduce Power-SMC, a training-free Sequential Monte Carlo scheme that targets the same objective while remaining close to standard decoding latency.<n>On MATH500, Power-SMC matches or exceeds MH power sampling while reducing latency from $16$--$28times$ to $1.4$--$3.3times$ over baseline decoding.
Score: 11.356198488445488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many recent reasoning gains in large language models can be explained as distribution sharpening: biasing generation toward high-likelihood trajectories already supported by the pretrained model, rather than modifying its weights. A natural formalization is the sequence-level power distribution $π_α(y\mid x)\propto p_θ(y\mid x)^α$ ($α>1$), which concentrates mass on whole sequences instead of adjusting token-level temperature. Prior work shows that Metropolis--Hastings (MH) sampling from this distribution recovers strong reasoning performance, but at order-of-magnitude inference slowdowns. We introduce Power-SMC, a training-free Sequential Monte Carlo scheme that targets the same objective while remaining close to standard decoding latency. Power-SMC advances a small particle set in parallel, corrects importance weights token-by-token, and resamples when necessary, all within a single GPU-friendly batched decode. We prove that temperature $τ=1/α$ is the unique prefix-only proposal minimizing incremental weight variance, interpret residual instability via prefix-conditioned Rényi entropies, and introduce an exponent-bridging schedule that improves particle stability without altering the target. On MATH500, Power-SMC matches or exceeds MH power sampling while reducing latency from $16$--$28\times$ to $1.4$--$3.3\times$ over baseline decoding.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models [58.946955321428845]
This work presents self-rewarding sequential Monte Carlo (SMC)<n>Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy.<n>We introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights.
arXiv Detail & Related papers (2026-02-02T09:21:45Z)
Synchrony-Gated Plasticity with Dopamine Modulation for Spiking Neural Networks [6.085945372100414]
Dopamine-Modulated Spike-Synchrony-Dependent Plasticity (DA-SSDP) is a synchrony-based rule that is sensitive to loss.<n>DA-SSDP condenses spike patterns into a synchrony metric at the batch level.
arXiv Detail & Related papers (2025-12-08T06:10:44Z)
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning [50.11170157029911]
In modern scale-invariant architectures, training quickly enters an degrading-governed steady state.<n>We introduce a weight-decay scaling rule for AdamW that preserves sublayer gain across widths.<n>Our results extend $mu$P beyond the near-init regime by explicitly controlling the steady-state scales set by parameters.
arXiv Detail & Related papers (2025-10-17T02:58:35Z)
Spectral gap of Metropolis-within-Gibbs under log-concavity [1.4466802614938334]
The Metropolis-within-Gibbs (MwG) algorithm is a widely used Markov Chain Monte Carlo method for sampling from high-dimensional distributions.<n>We study MwG with Random Walk Metropolis (RWM) updates, using proposal variances tuned to match the target's conditional variances.<n>The result shows that MwG can mix substantially faster with variance-adaptive proposals and that its mixing performance is just a constant factor worse than that of the exact Gibbs sampler.
arXiv Detail & Related papers (2025-09-30T12:31:22Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z)
FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [68.44043212834204]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
Langevin Quasi-Monte Carlo [6.146093081175471]
Langevin Monte Carlo (LMC) and its gradient versions are powerful algorithms for sampling from complex high-dimensional distributions. We show that the estimation error of LMC can also be reduced by using quasi-random samples.
arXiv Detail & Related papers (2023-09-22T07:15:18Z)
Quasi-Newton Quasi-Monte Carlo for variational Bayes [8.75682288556859]
We study the use of randomized quasi-Monte Carlo (RQMC) sampling for such problems. We prove that improved sampling accuracy translates directly to $O(n-1/2)$ in favorable settings.
arXiv Detail & Related papers (2021-04-07T02:34:03Z)
AMAGOLD: Amortized Metropolis Adjustment for Efficient Stochastic Gradient MCMC [37.768023232677244]
Hamiltonian Monte Carlo (SGHMC) is an efficient method for sampling from continuous distributions. We propose a novel second-order SG-MCMC algorithm---AMAGOLD---that infrequently uses Metropolis-Hastings (M-H) corrections to remove bias. We prove AMAGOLD converges to the target distribution with a fixed, rather than a diminishing, step size, and that its convergence rate is at most a constant factor slower than a full-batch baseline.
arXiv Detail & Related papers (2020-02-29T06:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.