Related papers: Towards a Theoretical Understanding to the Generalization of RLHF

Towards a Theoretical Understanding to the Generalization of RLHF

URL: http://arxiv.org/abs/2601.16403v1
Date: Fri, 23 Jan 2026 02:30:16 GMT
Title: Towards a Theoretical Understanding to the Generalization of RLHF
Authors: Zhaochun Li, Mingyang Yi, Yue Wang, Shisheng Cui, Yong Liu,
Abstract summary: We build the generalization theory on RLHF of LLMs under the linear reward model.<n>We argue that our results provide new theoretical evidence for the empirically observed generalization of LLMs after RLHF.
Score: 15.278675771756541
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) and its variants have emerged as the dominant approaches for aligning Large Language Models with human intent. While empirically effective, the theoretical generalization properties of these methods in high-dimensional settings remain to be explored. To this end, we build the generalization theory on RLHF of LLMs under the linear reward model, through the framework of algorithmic stability. In contrast to the existing works built upon the consistency of maximum likelihood estimations on reward model, our analysis is presented under an end-to-end learning framework, which is consistent with practice. Concretely, we prove that under a key \textbf{feature coverage} condition, the empirical optima of policy model have a generalization bound of order $\mathcal{O}(n^{-\frac{1}{2}})$. Moreover, the results can be extrapolated to parameters obtained by gradient-based learning algorithms, i.e., Gradient Ascent (GA) and Stochastic Gradient Ascent (SGA). Thus, we argue that our results provide new theoretical evidence for the empirically observed generalization of LLMs after RLHF.

Related papers

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z)
Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks [0.0]
We develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property.<n>We demonstrate that training deep neural networks (DNNs) with mini-batch descent (SGD) achieves global optima of empirical risk.<n>We derive deterministic and probabilistic bounds on generalization error based on conditional generalized entropy measures.
arXiv Detail & Related papers (2026-02-18T04:26:55Z)
How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning [12.77713716713937]
We provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators.<n>We derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients.<n>We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction.
arXiv Detail & Related papers (2025-11-28T16:09:28Z)
Algorithm- and Data-Dependent Generalization Bounds for Score-Based Generative Models [27.78637798976204]
Score-based generative models (SGMs) have emerged as one of the most popular classes of generative models.<n>This paper provides the first algorithmic- and data-dependent analysis for SGMs.<n>In particular, we account for the dynamics of the learning algorithm, offering new insights into the behavior of SGMs.
arXiv Detail & Related papers (2025-06-04T11:33:04Z)
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws [52.10468229008941]
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting.<n>We provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model.<n>Building on these insights, we introduce a novel method for Contrastive Language-Image Pretraining with a reference model, termed DRRho-CLIP.
arXiv Detail & Related papers (2025-05-10T16:55:03Z)
Can DPO Learn Diverse Human Values? A Theoretical Scaling Law [7.374590753074647]
Preference learning trains models to distinguish between preferred and non-preferred responses based on human feedback.<n>This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity.<n>Our framework rigorously assesses how well models generalize after a finite number of gradient steps.
arXiv Detail & Related papers (2024-08-06T22:11:00Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)
A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime. We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Deviance Matrix Factorization [6.509665408765348]
We investigate a general matrix factorization for deviance-based data losses, extending the ubiquitous singular value decomposition beyond squared error loss. Our method leverages classical statistical methodology from generalized linear models (GLMs) and provides an efficient algorithm that is flexible enough to allow for structural zeros via entry weights.
arXiv Detail & Related papers (2021-10-12T01:27:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.