Occupancy Information Ratio: Infinite-Horizon, Information-Directed,
Parameterized Policy Search
- URL: http://arxiv.org/abs/2201.08832v2
- Date: Thu, 28 Dec 2023 05:11:22 GMT
- Title: Occupancy Information Ratio: Infinite-Horizon, Information-Directed,
Parameterized Policy Search
- Authors: Wesley A. Suttle, Alec Koppel, Ji Liu
- Abstract summary: We propose an information-directed objective for infinite-horizon reinforcement learning (RL) called the occupancy information ratio (OIR)
The OIR enjoys rich underlying structure and presents an objective to which scalable, model-free policy search methods naturally apply.
We show by leveraging connections between quasiconcave optimization and the linear programming theory for Markov decision processes that the OIR problem can be transformed and solved via concave programming methods when the underlying model is known.
- Score: 21.850348833971722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose an information-directed objective for
infinite-horizon reinforcement learning (RL), called the occupancy information
ratio (OIR), inspired by the information ratio objectives used in previous
information-directed sampling schemes for multi-armed bandits and Markov
decision processes as well as recent advances in general utility RL. The OIR,
comprised of a ratio between the average cost of a policy and the entropy of
its induced state occupancy measure, enjoys rich underlying structure and
presents an objective to which scalable, model-free policy search methods
naturally apply. Specifically, we show by leveraging connections between
quasiconcave optimization and the linear programming theory for Markov decision
processes that the OIR problem can be transformed and solved via concave
programming methods when the underlying model is known. Since model knowledge
is typically lacking in practice, we lay the foundations for model-free OIR
policy search methods by establishing a corresponding policy gradient theorem.
Building on this result, we subsequently derive REINFORCE- and
actor-critic-style algorithms for solving the OIR problem in policy parameter
space. Crucially, exploiting the powerful hidden quasiconcavity property
implied by the concave programming transformation of the OIR problem, we
establish finite-time convergence of the REINFORCE-style scheme to global
optimality and asymptotic convergence of the actor-critic-style scheme to
(near) global optimality under suitable conditions. Finally, we experimentally
illustrate the utility of OIR-based methods over vanilla methods in
sparse-reward settings, supporting the OIR as an alternative to existing RL
objectives.
Related papers
- Policy Gradient for Robust Markov Decision Processes [16.281897051782863]
This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (MD), for solving robust Markov Decision Processes (MDPs)
MD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy.
We provide a comprehensive analysis of MD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA)
arXiv Detail & Related papers (2024-10-29T15:16:02Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms [34.593772931446125]
monograph focuses on the exploration of various model-based and model-free approaches for Constrained within the context of average reward Markov Decision Processes (MDPs)
The primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs.
arXiv Detail & Related papers (2024-06-17T12:46:02Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation.
By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a
Finite Horizon [3.867363075280544]
We explore reinforcement learning methods for finding the optimal policy in the linear quadratic regulator (LQR) problem.
We produce a global linear convergence guarantee for the setting of finite time horizon and state dynamics under weak assumptions.
We show results for the case where we assume a model for the underlying dynamics and where we apply the method to the data directly.
arXiv Detail & Related papers (2020-11-20T09:51:49Z) - Structured Policy Iteration for Linear Quadratic Regulator [40.52288246664592]
We introduce the textitStructured Policy Iteration (S-PI) for LQR, a method capable of deriving a structured linear policy.
Such a structured policy with (block) sparsity or low-rank can have significant advantages over the standard LQR policy.
In both the known-model and model-free setting, we prove convergence analysis under the proper choice of parameters.
arXiv Detail & Related papers (2020-07-13T06:03:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.