POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy
Decomposition
- URL: http://arxiv.org/abs/2402.06151v1
- Date: Fri, 9 Feb 2024 03:01:13 GMT
- Title: POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy
Decomposition
- Authors: Yuta Saito, Jihan Yao, Thorsten Joachims
- Abstract summary: We study off-policy learning of contextual bandit policies in large discrete action spaces.
We propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition.
We show that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
- Score: 40.851324484481275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study off-policy learning (OPL) of contextual bandit policies in large
discrete action spaces where existing methods -- most of which rely crucially
on reward-regression models or importance-weighted policy gradients -- fail due
to excessive bias or variance. To overcome these issues in OPL, we propose a
novel two-stage algorithm, called Policy Optimization via Two-Stage Policy
Decomposition (POTEC). It leverages clustering in the action space and learns
two different policies via policy- and regression-based approaches,
respectively. In particular, we derive a novel low-variance gradient estimator
that enables to learn a first-stage policy for cluster selection efficiently
via a policy-based approach. To select a specific action within the cluster
sampled by the first-stage policy, POTEC uses a second-stage policy derived
from a regression-based approach within each cluster. We show that a local
correctness condition, which only requires that the regression model preserves
the relative expected reward differences of the actions within each cluster,
ensures that our policy-gradient estimator is unbiased and the second-stage
policy is optimal. We also show that POTEC provides a strict generalization of
policy- and regression-based approaches and their associated assumptions.
Comprehensive experiments demonstrate that POTEC provides substantial
improvements in OPL effectiveness particularly in large and structured action
spaces.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Fast Policy Learning for Linear Quadratic Control with Entropy
Regularization [10.771650397337366]
This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic control (LQC) problems.
Assuming access to the exact policy evaluation, both proposed approaches are proven to converge linearly in finding optimal policies of the regularized LQC.
arXiv Detail & Related papers (2023-11-23T19:08:39Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees [8.610425739792284]
We revisit the domain of off-policy policy optimization in RL.
One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective.
This approach has been shown to suffer from the distribution mismatch issue.
arXiv Detail & Related papers (2022-12-10T07:47:04Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Projection-Based Constrained Policy Optimization [34.555500347840805]
We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO)
PCPO achieves more than 3.5 times less constraint violation and around 15% higher reward compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-10-07T04:22:45Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.