Why Self-Rewarding Works: Theoretical Guarantees for Iterative Alignment of Language Models
- URL: http://arxiv.org/abs/2601.22513v2
- Date: Tue, 03 Feb 2026 07:16:33 GMT
- Title: Why Self-Rewarding Works: Theoretical Guarantees for Iterative Alignment of Language Models
- Authors: Shi Fu, Yingjie Wang, Shengchao Hu, Peng Wang, Dacheng Tao,
- Abstract summary: Self-Rewarding Language Models (SRLMs) achieve notable success in iteratively improving alignment without external feedback.<n>This paper provides the first rigorous theoretical guarantees for SRLMs.
- Score: 50.248686344277246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-Rewarding Language Models (SRLMs) achieve notable success in iteratively improving alignment without external feedback. Yet, despite their striking empirical progress, the core mechanisms driving their capabilities remain unelucidated, leaving a critical gap in theoretical understanding. This paper provides the first rigorous theoretical guarantees for SRLMs. We first establish a lower bound that characterizes the fundamental limits of a single update step, revealing a critical dependence on the quality of the initial model. We then derive finite-sample error bounds for the full iterative paradigm, showing that performance improves at a rate of $\widetilde{\mathcal{O}}\left(1/\sqrt{n}\right)$ with sample size $n$. Crucially, our analysis reveals that the dependence on the initial model decays exponentially with the number of iterations $T$. This provides a formal explanation for why self-rewarding succeeds: it robustly overcomes poor initialization by steering the dynamics toward internal stability and consistency. Finally, we instantiate our theoretical framework for the linear softmax model class, yielding tailored guarantees that connect our high-level insights to practical model architectures.
Related papers
- On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z) - Recursive Think-Answer Process for LLMs and VLMs [54.52289112197118]
We propose an efficient Recursive Think-Answer Process (R-TAP)<n>R-TAP enables models to engage in iterative reasoning cycles and generate more accurate answers.<n>We show that R-TAP-enhanced models consistently outperform conventional single-pass methods.
arXiv Detail & Related papers (2026-03-02T17:20:10Z) - A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula [16.2171923772074]
Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs.<n>We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning.<n>Our analysis reveals an explicit feedback loop where better models accept more data per, supporting sustained self-improvement.
arXiv Detail & Related papers (2026-02-10T17:36:41Z) - Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models [15.849480549367684]
We propose DAGVul, a novel framework that models vulnerability reasoning as a Directed Acyclic Graph (DAG) generation task.<n>By further introducing Reinforcement Learning with Verifiable Rewards (RLVR), we align model reasoning trace with program-intrinsic logic.<n>Our framework improves the reasoning F1-score by an average of 18.9% over all the baselines.
arXiv Detail & Related papers (2026-02-06T13:19:45Z) - Do Reasoning Models Enhance Embedding Models? [48.43242995118735]
State-of-the-art embedding models are increasingly derived from decoder-only Large Language Model backbones adapted via contrastive learning.<n>We show that embedding models from RLVR-tuned backbones yield no consistent performance advantage over their base counterparts when subjected to identical training recipes.
arXiv Detail & Related papers (2026-01-29T02:48:34Z) - Autoregressivity in the Latent Space of a GP-VAE Language Model: An Empirical Ablation Study [0.0]
Language models typically rely on an autoregressive factorization over tokens.<n>We conduct a systematic ablation study of the role played by latent autoregression.
arXiv Detail & Related papers (2025-12-30T09:23:09Z) - Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement [54.63337314382886]
We introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve internal thought process quality.<n>For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten.<n>Experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting.
arXiv Detail & Related papers (2025-11-20T13:10:52Z) - Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models [57.42778606399764]
Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation.<n>Current reinforcement learning approaches often rely on sparse, outcome-based rewards.<n>We argue that this stems from a fundamental mismatch with the natural structure of reasoning.
arXiv Detail & Related papers (2025-10-02T00:34:15Z) - Latent Principle Discovery for Language Model Self-Improvement [14.137106102563514]
We propose eliciting latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting.<n>Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering.<n>We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval.
arXiv Detail & Related papers (2025-05-22T17:20:18Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Unveiling and Addressing Pseudo Forgetting in Large Language Models [17.888328120571245]
We show that the performance degradation on previous tasks is not attributed to a loss of capabilities, but rather to the failure of the instructions to activate the appropriate model abilities.<n>We propose Rationale-Guidance Difficulty based Replay (RGD-R) framework that dynamically allocates replay data based on the model's ability to correctly leverage the intrinsic capabilities.
arXiv Detail & Related papers (2024-11-18T14:28:04Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.