STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models
- URL: http://arxiv.org/abs/2511.03827v1
- Date: Wed, 05 Nov 2025 19:47:58 GMT
- Title: STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models
- Authors: Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, Z. Berkay Celik,
- Abstract summary: We propose STARS: Segment-level Token Alignment with Rejection Sampling.<n>It steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments.<n>Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods.
- Score: 14.772543098527786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.
Related papers
- Direct Preference Optimization with Rating Information: Practical Algorithms and Provable Gains [67.71020482405343]
We study how to design algorithms that can leverage additional information in the form of rating gap.<n>We present new algorithms that can achieve faster statistical rates than DPO in presence of accurate rating gap information.
arXiv Detail & Related papers (2026-01-31T08:38:21Z) - TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model [21.82904448561646]
Large language models (LLMs) have shown remarkable progress in complex reasoning tasks.<n>Best-of-N selection paradigm yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories.<n>We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring.
arXiv Detail & Related papers (2025-10-18T11:01:39Z) - Leveraging Robust Optimization for LLM Alignment under Distribution Shifts [51.74394601039711]
Preference alignment methods are increasingly critical for steering large language models to generate outputs consistent with human values.<n>We propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts.
arXiv Detail & Related papers (2025-04-08T09:14:38Z) - InfAlign: Inference-aware language model alignment [58.66389179049758]
Language model alignment is a critical step in training modern generative language models.<n>We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of inference-time methods.<n>We propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model.
arXiv Detail & Related papers (2024-12-27T18:45:36Z) - Constrain Alignment with Sparse Autoencoders [45.131670081186]
Feature-level constrained Preference Optimization is a novel method designed to simplify the alignment process while ensuring stability.<n>Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence.
arXiv Detail & Related papers (2024-11-12T07:54:13Z) - Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.<n>We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - Fast Best-of-N Decoding via Speculative Rejection [49.11955026456773]
Inference-time alignment methods avoid the complex post-training step.
Best-of-N requires vastly more resources at inference time than standard decoding strategies.
We introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm.
arXiv Detail & Related papers (2024-10-26T23:20:48Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.