Zeroth-Order Supervised Policy Improvement
- URL: http://arxiv.org/abs/2006.06600v2
- Date: Mon, 5 Jul 2021 07:18:16 GMT
- Title: Zeroth-Order Supervised Policy Improvement
- Authors: Hao Sun, Ziping Xu, Yuhang Song, Meng Fang, Jiechao Xiong, Bo Dai,
Bolei Zhou
- Abstract summary: Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
- Score: 94.0748002906652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy gradient (PG) algorithms have been widely used in reinforcement
learning (RL). However, PG algorithms rely on exploiting the value function
being learned with the first-order update locally, which results in limited
sample efficiency. In this work, we propose an alternative method called
Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the
estimated value function $Q$ globally while preserving the local exploitation
of the PG methods based on zeroth-order policy optimization. This learning
paradigm follows Q-learning but overcomes the difficulty of efficiently
operating argmax in continuous action space. It finds max-valued action within
a small number of samples. The policy learning of ZOSPI has two steps: First,
it samples actions and evaluates those actions with a learned value estimator,
and then it learns to perform the action with the highest value through
supervised learning. We further demonstrate such a supervised learning
framework can learn multi-modal policies. Experiments show that ZOSPI achieves
competitive results on the continuous control benchmarks with a remarkable
sample efficiency.
Related papers
- Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning [19.4531905603925]
i-QN is a principled approach that enables multiple consecutive Bellman updates by learning a tailored sequence of action-value functions.
We show that i-QN is theoretically grounded and that it can be seamlessly used in value-based and actor-critic methods.
arXiv Detail & Related papers (2024-03-04T15:07:33Z) - Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Critic [42.57662196581823]
Learning high-quality $Q$-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms.
Deviating from the common viewpoint, we observe that $Q$-values are often underestimated in the latter stage of the RL training process.
We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates $Q$-value using both historical best-performing actions and the current policy.
arXiv Detail & Related papers (2023-06-05T13:38:14Z) - Offline RL with No OOD Actions: In-Sample Learning via Implicit Value
Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples.
We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework.
We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z) - Inapplicable Actions Learning for Knowledge Transfer in Reinforcement
Learning [3.194414753332705]
We show that learning inapplicable actions greatly improves the sample efficiency of RL algorithms.
Thanks to the transferability of the knowledge acquired, it can be reused in other tasks and domains to make the learning process more efficient.
arXiv Detail & Related papers (2022-11-28T17:45:39Z) - Exploration via Planning for Information about the Optimal Trajectory [67.33886176127578]
We develop a method that allows us to plan for exploration while taking the task and the current knowledge into account.
We demonstrate that our method learns strong policies with 2x fewer samples than strong exploration baselines.
arXiv Detail & Related papers (2022-10-06T20:28:55Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Direct Random Search for Fine Tuning of Deep Reinforcement Learning
Policies [5.543220407902113]
We show that a direct random search is very effective at fine-tuning DRL policies by directly optimizing them using deterministic rollouts.
Our results show that this method yields more consistent and higher performing agents on the environments we tested.
arXiv Detail & Related papers (2021-09-12T20:12:46Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.