Related papers: Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

URL: http://arxiv.org/abs/2510.26219v1
Date: Thu, 30 Oct 2025 07:52:14 GMT
Title: Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space
Authors: Sekitoshi Kanai, Tsukasa Yoshida, Hiroshi Takahashi, Haru Kuroki, Kazumune Hashimoto,
Abstract summary: We propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP)<n>AISP applies the perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation.<n>AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.
Score: 15.104280833614157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test-time alignment of large language models (LLMs) attracts attention because fine-tuning LLMs requires high computational costs. In this paper, we propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.

Related papers

Learnable Chernoff Baselines for Inference-Time Alignment [64.81256817158851]
We introduce Learnable Chernoff Baselines as a method for efficiently and approximately sampling from exponentially tilted kernels.<n>We establish total-variation guarantees to the ideal aligned model, and demonstrate in both continuous and discrete diffusion settings that LCB sampling closely matches ideal rejection sampling.
arXiv Detail & Related papers (2026-02-08T00:09:40Z)
G$^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO (G$2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales.<n>Our G$2$RPO significantly outperforms existing flow-based GRPO baselines.
arXiv Detail & Related papers (2025-10-02T12:57:12Z)
Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner [24.152878302325508]
We introduce the reward-shifted speculative sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged.<n>Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments.
arXiv Detail & Related papers (2025-08-20T20:10:56Z)
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z)
Psi-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models [26.211711150915203]
$Psi$-Sampler is an SMC-based framework incorporating pCNL-based initial particle sampling.<n>Inference-time reward alignment with score-based generative models has gained significant traction.
arXiv Detail & Related papers (2025-06-02T05:02:33Z)
Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization [66.67988187816185]
We aim to emphscale up the number of on-policy samples via repeated random sampling to improve alignment performance.<n>Our experiments reveal that this strategy leads to a emphdecline in performance as the sample size increases.<n>We introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
arXiv Detail & Related papers (2025-02-24T04:22:57Z)
Optimizing Input Data Collection for Ranking and Selection [2.3708672042234213]
We design a sequential sampling algorithm that collects input and simulation data given a budget.<n>We show that MPB's posterior probability of optimality converges to one at an exponential rate as the sampling budget increases.<n>We extend OSAR by adopting the kernel ridge regression to improve the simulation output mean prediction.
arXiv Detail & Related papers (2025-02-23T17:33:43Z)
Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
Boost Test-Time Performance with Closed-Loop Inference [85.43516360332646]
We propose to predict hard-classified test samples in a looped manner to boost the model performance. We first devise a filtering criterion to identify those hard-classified test samples that need additional inference loops. For each hard sample, we construct an additional auxiliary learning task based on its original top-$K$ predictions to calibrate the model.
arXiv Detail & Related papers (2022-03-21T10:20:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.