Related papers: Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

URL: http://arxiv.org/abs/2512.21917v1
Date: Fri, 26 Dec 2025 08:22:41 GMT
Title: Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model
Authors: Nathan Kallus,
Abstract summary: We study policy alignment to preferences under an unknown and unrestricted complexity.<n>We use first-order optimization suited to neural networks and batched data.
Score: 43.74350307533018
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning large language models to preference data is commonly implemented by assuming a known link function between the distribution of observed preferences and the unobserved rewards (e.g., a logistic link as in Bradley-Terry). If the link is wrong, however, inferred rewards can be biased and policies be misaligned. We study policy alignment to preferences under an unknown and unrestricted link. We consider an $f$-divergence-constrained reward maximization problem and show that realizability of the solution in a policy class implies a semiparametric single-index binary choice model, where a scalar-valued index determined by a policy captures the dependence on demonstrations and the rest of the preference distribution is an unrestricted function thereof. Rather than focus on estimation of identifiable finite-dimensional structural parameters in the index as in econometrics, we focus on policy learning, focusing on error to the optimal policy and allowing unidentifiable and nonparametric indices. We develop a variety of policy learners based on profiling the link function, orthogonalizing the link function, and using link-agnostic bipartite ranking objectives. We analyze these and provide finite-sample policy error bounds that depend on generic functional complexity measures of the index class. We further consider practical implementations using first-order optimization suited to neural networks and batched data. The resulting methods are robust to unknown preference noise distribution and scale, while preserving the direct optimization of policies without explicitly fitting rewards.

Related papers

Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization [4.154714580436713]
We propose a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy.<n>We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven.<n>All the code to reproduce the results can be found at https://anonymous.4open.science/r/Prescribe-then-Select-TMLR.
arXiv Detail & Related papers (2025-09-09T23:56:16Z)
On Symmetric Losses for Robust Policy Optimization with Noisy Preferences [55.8615920580824]
This work focuses on reward modeling, a core component in reinforcement learning from human feedback.<n>We propose a principled framework for robust policy optimization under noisy preferences.<n>We prove that symmetric losses enable successful policy optimization even under noisy labels.
arXiv Detail & Related papers (2025-05-30T15:30:43Z)
What is the Alignment Objective of GRPO? [30.36318490634376]
We present a framework that enables us to characterise the stationary policies of the GRPO algorithm.<n>The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function.<n>We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size.
arXiv Detail & Related papers (2025-02-25T15:56:56Z)
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [15.038210624870656]
Reward inference is a critical intermediate step in the Reinforcement Learning from Human Feedback pipeline.<n>This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDP bandit, and general preference models beyond the Bradley-Terry model.
arXiv Detail & Related papers (2024-09-25T22:20:11Z)
Provably Robust DPO: Aligning Language Models with Noisy Feedback [10.523790076060171]
We introduce a general framework for policy optimization in the presence of random preference flips. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO.
arXiv Detail & Related papers (2024-03-01T09:55:18Z)
Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice. We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
Score-Aware Policy-Gradient and Performance Guarantees using Local Lyapunov Stability [2.180257135067774]
We introduce a policy-gradient method for model-based reinforcement learning (RL)<n>We exploit a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in networks.<n>We show that SAGE-based policy-gradient locally converges, and we obtain its regret.
arXiv Detail & Related papers (2023-12-05T14:44:58Z)
Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback. We consider the general reward setting where the reward can be defined over the whole trajectory. We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z)
Randomized Policy Optimization for Optimal Stopping [0.0]
We propose a new methodology for optimal stopping based on randomized linear policies. We show that our approach can substantially outperform state-of-the-art methods.
arXiv Detail & Related papers (2022-03-25T04:33:15Z)
Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response. We construct unbiased estimators for the policy-dependent estimand by a perturbation method. We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z)
Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used. Second, to explain these findings we introduce the concept of committal rate for policy optimization. Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z)
Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.