Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning
- URL: http://arxiv.org/abs/2602.10273v1
- Date: Tue, 10 Feb 2026 20:31:40 GMT
- Title: Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning
- Authors: Seyedarmin Azizi, Erfan Baghaei Potraghloo, Minoo Ahmadi, Souvik Kundu, Massoud Pedram,
- Abstract summary: We introduce Power-SMC, a training-free Sequential Monte Carlo scheme that targets the same objective while remaining close to standard decoding latency.<n>On MATH500, Power-SMC matches or exceeds MH power sampling while reducing latency from $16$--$28times$ to $1.4$--$3.3times$ over baseline decoding.
- Score: 11.356198488445488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many recent reasoning gains in large language models can be explained as distribution sharpening: biasing generation toward high-likelihood trajectories already supported by the pretrained model, rather than modifying its weights. A natural formalization is the sequence-level power distribution $π_α(y\mid x)\propto p_θ(y\mid x)^α$ ($α>1$), which concentrates mass on whole sequences instead of adjusting token-level temperature. Prior work shows that Metropolis--Hastings (MH) sampling from this distribution recovers strong reasoning performance, but at order-of-magnitude inference slowdowns. We introduce Power-SMC, a training-free Sequential Monte Carlo scheme that targets the same objective while remaining close to standard decoding latency. Power-SMC advances a small particle set in parallel, corrects importance weights token-by-token, and resamples when necessary, all within a single GPU-friendly batched decode. We prove that temperature $τ=1/α$ is the unique prefix-only proposal minimizing incremental weight variance, interpret residual instability via prefix-conditioned Rényi entropies, and introduce an exponent-bridging schedule that improves particle stability without altering the target. On MATH500, Power-SMC matches or exceeds MH power sampling while reducing latency from $16$--$28\times$ to $1.4$--$3.3\times$ over baseline decoding.
Related papers
- $\
abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z) - Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models [58.946955321428845]
This work presents self-rewarding sequential Monte Carlo (SMC)<n>Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy.<n>We introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights.
arXiv Detail & Related papers (2026-02-02T09:21:45Z) - Synchrony-Gated Plasticity with Dopamine Modulation for Spiking Neural Networks [6.085945372100414]
Dopamine-Modulated Spike-Synchrony-Dependent Plasticity (DA-SSDP) is a synchrony-based rule that is sensitive to loss.<n>DA-SSDP condenses spike patterns into a synchrony metric at the batch level.
arXiv Detail & Related papers (2025-12-08T06:10:44Z) - Robust Layerwise Scaling Rules by Proper Weight Decay Tuning [50.11170157029911]
In modern scale-invariant architectures, training quickly enters an degrading-governed steady state.<n>We introduce a weight-decay scaling rule for AdamW that preserves sublayer gain across widths.<n>Our results extend $mu$P beyond the near-init regime by explicitly controlling the steady-state scales set by parameters.
arXiv Detail & Related papers (2025-10-17T02:58:35Z) - Spectral gap of Metropolis-within-Gibbs under log-concavity [1.4466802614938334]
The Metropolis-within-Gibbs (MwG) algorithm is a widely used Markov Chain Monte Carlo method for sampling from high-dimensional distributions.<n>We study MwG with Random Walk Metropolis (RWM) updates, using proposal variances tuned to match the target's conditional variances.<n>The result shows that MwG can mix substantially faster with variance-adaptive proposals and that its mixing performance is just a constant factor worse than that of the exact Gibbs sampler.
arXiv Detail & Related papers (2025-09-30T12:31:22Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z) - FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [68.44043212834204]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z) - FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z) - Langevin Quasi-Monte Carlo [6.146093081175471]
Langevin Monte Carlo (LMC) and its gradient versions are powerful algorithms for sampling from complex high-dimensional distributions.
We show that the estimation error of LMC can also be reduced by using quasi-random samples.
arXiv Detail & Related papers (2023-09-22T07:15:18Z) - Quasi-Newton Quasi-Monte Carlo for variational Bayes [8.75682288556859]
We study the use of randomized quasi-Monte Carlo (RQMC) sampling for such problems.
We prove that improved sampling accuracy translates directly to $O(n-1/2)$ in favorable settings.
arXiv Detail & Related papers (2021-04-07T02:34:03Z) - AMAGOLD: Amortized Metropolis Adjustment for Efficient Stochastic
Gradient MCMC [37.768023232677244]
Hamiltonian Monte Carlo (SGHMC) is an efficient method for sampling from continuous distributions.
We propose a novel second-order SG-MCMC algorithm---AMAGOLD---that infrequently uses Metropolis-Hastings (M-H) corrections to remove bias.
We prove AMAGOLD converges to the target distribution with a fixed, rather than a diminishing, step size, and that its convergence rate is at most a constant factor slower than a full-batch baseline.
arXiv Detail & Related papers (2020-02-29T06:57:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.