Related papers: Policy Optimization as Online Learning with Mediator Feedback

Policy Optimization as Online Learning with Mediator Feedback

URL: http://arxiv.org/abs/2012.08225v1
Date: Tue, 15 Dec 2020 11:34:29 GMT
Title: Policy Optimization as Online Learning with Mediator Feedback
Authors: Alberto Maria Metelli, Matteo Papini, Pierluca D'Oro, and Marcello Restelli
Abstract summary: Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
Score: 46.845765216238135
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. The additional available information, compared to the standard bandit feedback, allows reusing samples generated by one policy to estimate the performance of other policies. Based on this observation, we propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RANDOMIST), for regret minimization in PO, that employs a randomized exploration strategy, differently from the existing optimistic approaches. When the policy space is finite, we show that under certain circumstances, it is possible to achieve constant regret, while always enjoying logarithmic regret. We also derive problem-dependent regret lower bounds. Then, we extend RANDOMIST to compact policy spaces. Finally, we provide numerical simulations on finite and compact policy spaces, in comparison with PO and bandit baselines.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Enhancing PPO with Trajectory-Aware Hybrid Policies [6.938941097426891]
Proximal policy optimization (PPO) is one of the most popular state-of-the-art on-policy algorithms. High variance, and high sample complexity still remain critical challenges in on-policy algorithms. We propose Hybrid-Policy Proximal Policy Optimization (HP3O), which utilizes a trajectory replay buffer to make efficient use of trajectories generated by recent policies.
arXiv Detail & Related papers (2025-02-21T22:00:13Z)
Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy. Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z)
Fat-to-Thin Policy Optimization: Offline RL with Sparse Policies [5.5938591697033555]
Sparse continuous policies are distributions that choose some actions at random yet keep strictly zero probability for the other actions. In this paper, we propose the first offline policy optimization algorithm that tackles this challenge: Fat-to-Thin Policy Optimization (FtTPO) We instantiate FtTPO with the general $q$-Gaussian family that encompasses both heavy-tailed and sparse policies.
arXiv Detail & Related papers (2025-01-24T10:11:48Z)
Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z)
Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems. The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy. We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits [13.02672341061555]
We study the problem of optimal data collection for policy evaluation in linear bandits. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting. We then use this formulation to derive the optimal allocation of samples per action during data collection.
arXiv Detail & Related papers (2023-01-29T04:33:13Z)
CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
We consider and study a distribution of optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
arXiv Detail & Related papers (2022-05-19T09:48:56Z)
Zeroth-Order Actor-Critic [6.5158195776494]
We propose Zeroth-Order Actor-Critic algorithm (ZOAC) that unifies these two methods into an on-policy actor-critic architecture. We evaluate our proposed method on a range of challenging continuous control benchmarks using different types of policies, where ZOAC outperforms zeroth-order and first-order baseline algorithms.
arXiv Detail & Related papers (2022-01-29T07:09:03Z)
Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains. There exist no guarantees on REPS's performance when using gradient-based solvers. We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z)
Optimization Issues in KL-Constrained Approximate Policy Iteration [48.24321346619156]
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API) While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy. Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
arXiv Detail & Related papers (2021-02-11T19:35:33Z)
Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents. We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively. We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)
Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.