Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation
- URL: http://arxiv.org/abs/2509.03456v1
- Date: Wed, 03 Sep 2025 16:25:45 GMT
- Title: Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation
- Authors: Imad Aouali, Otmane Sakhi,
- Abstract summary: Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits.<n>Recent advances in OPL primarily optimize OPE estimators with improved statistical properties.<n>We argue this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes.
- Score: 6.001574550157585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, we argue this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes. In this paper, we provide theoretical insights and extensive empirical evidence showing that current OPL methods encounter severe optimization issues, particularly as action spaces become large. We demonstrate that simpler weighted log-likelihood objectives enjoy substantially better optimization properties and still recover competitive, often superior, learned policies. Our findings emphasize the necessity of explicitly addressing optimization considerations in the development of OPL algorithms for large action spaces.
Related papers
- Reinforcement Learning-assisted Constraint Relaxation for Constrained Expensive Optimization [14.12072551134237]
We propose learning effective, adaptive and generalizable constraint handling policy through reinforcement learning.<n>Specifically, a tailored Markov Decision Process is first formulated, where given optimization dynamics features, a deep Q-network-based policy controls the constraint relaxation level.<n>Such adaptive constraint handling provides flexible tradeoff between objective-oriented exploitation and feasible-region-oriented exploration.
arXiv Detail & Related papers (2026-01-31T05:52:36Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum [45.135858299101386]
Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs)<n>Group relative advantage estimation has attracted considerable attention for eliminating the dependency on the value model.<n>We propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimize the cross-entropy loss using advantages enhanced through a momentum-based estimation scheme.
arXiv Detail & Related papers (2025-05-20T12:13:44Z) - A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z) - Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - Preference-Guided Reinforcement Learning for Efficient Exploration [14.058764537783086]
We introduce LOPE: textbfLearning textbfOnline with trajectory textbfPreferencedanctextbfE, an end-to-end preference-guided RL framework.<n>Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance.<n>LOPE outperforms several state-of-the-art methods in terms of convergence rate and overall performance.
arXiv Detail & Related papers (2024-07-09T02:11:12Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning [7.085987593010675]
This work investigates the offline formulation of the contextual bandit problem.
The goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies.
We introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators.
arXiv Detail & Related papers (2024-05-23T09:07:27Z) - Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback.
In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Pessimistic Off-Policy Multi-Objective Optimization [22.525654101072252]
We study offline optimization of multi-objective policies from data collected by an existing policy.
We propose a pessimistic estimator for the multi-objective policy values that can be easily plugged into existing formulas for hypervolume computation and optimized.
arXiv Detail & Related papers (2023-10-28T06:50:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.