Careful at Estimation and Bold at Exploration
- URL: http://arxiv.org/abs/2308.11348v1
- Date: Tue, 22 Aug 2023 10:52:46 GMT
- Title: Careful at Estimation and Bold at Exploration
- Authors: Xing Chen, Yijun Liu, Zhaogeng Liu, Hechang Chen, Hengshuai Yao, Yi
Chang
- Abstract summary: Policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning.
However, policy-based exploration has two prominent issues: aimless exploration and policy divergence.
We introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient.
- Score: 21.518406902400432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploration strategies in continuous action space are often heuristic due to
the infinite actions, and these kinds of methods cannot derive a general
conclusion. In prior work, it has been shown that policy-based exploration is
beneficial for continuous action space in deterministic policy reinforcement
learning(DPRL). However, policy-based exploration in DPRL has two prominent
issues: aimless exploration and policy divergence, and the policy gradient for
exploration is only sometimes helpful due to inaccurate estimation. Based on
the double-Q function framework, we introduce a novel exploration strategy to
mitigate these issues, separate from the policy gradient. We first propose the
greedy Q softmax update schema for Q value update. The expected Q value is
derived by weighted summing the conservative Q value over actions, and the
weight is the corresponding greedy Q value. Greedy Q takes the maximum value of
the two Q functions, and conservative Q takes the minimum value of the two
different Q functions. For practicality, this theoretical basis is then
extended to allow us to combine action exploration with the Q value update,
except for the premise that we have a surrogate policy that behaves like this
exploration policy. In practice, we construct such an exploration policy with a
few sampled actions, and to meet the premise, we learn such a surrogate policy
by minimizing the KL divergence between the target policy and the exploration
policy constructed by the conservative Q. We evaluate our method on the Mujoco
benchmark and demonstrate superior performance compared to previous
state-of-the-art methods across various environments, particularly in the most
complex Humanoid environment.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Wasserstein Actor-Critic: Directed Exploration via Optimism for
Continuous-Actions Control [41.7453231409493]
Wasserstein Actor-Critic ( WAC) is an actor-critic architecture inspired by the Wasserstein Q-Learning (WQL) citepwql.
WAC enforces exploration in a principled way by guiding the policy learning process with the optimization of an upper bound of the Q-value estimates.
arXiv Detail & Related papers (2023-03-04T10:52:20Z) - Sampling Efficient Deep Reinforcement Learning through Preference-Guided
Stochastic Exploration [8.612437964299414]
We propose a preference-guided $epsilon$-greedy exploration algorithm for Deep Q-network (DQN)
We show that preference-guided exploration motivates the DQN agent to take diverse actions, i.e., actions with larger Q-values can be sampled more frequently whereas actions with smaller Q-values still have a chance to be explored, thus encouraging the exploration.
arXiv Detail & Related papers (2022-06-20T08:23:49Z) - Guarantees for Epsilon-Greedy Reinforcement Learning with Function
Approximation [69.1524391595912]
Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks.
This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration.
arXiv Detail & Related papers (2022-06-19T14:44:40Z) - Sigmoidally Preconditioned Off-policy Learning:a new exploration method
for reinforcement learning [14.991913317341417]
We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O)
P3O can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective.
Results show that our P3O maximizes the CPI objective better than PPO during the training process.
arXiv Detail & Related papers (2022-05-20T09:38:04Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State
Entropy Estimate [40.97686031763918]
In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy?
We argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target.
We present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy.
arXiv Detail & Related papers (2020-07-09T08:44:39Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.