Off-Policy Deep Reinforcement Learning with Analogous Disentangled
Exploration
- URL: http://arxiv.org/abs/2002.10738v2
- Date: Thu, 27 Feb 2020 22:19:22 GMT
- Title: Off-Policy Deep Reinforcement Learning with Analogous Disentangled
Exploration
- Authors: Anji Liu, Yitao Liang, Guy Van den Broeck
- Abstract summary: Off-policy reinforcement learning (RL) is concerned with learning a rewarding policy by executing another policy that gathers samples of experience.
While the former policy is rewarding but in-expressive (in most cases, deterministic), doing well in the latter task, in contrast, requires an expressive policy that offers guided and effective exploration.
We propose Analogous Disentangled Actor-Critic (ADAC) to mitigate this problem.
- Score: 33.25932244741268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy reinforcement learning (RL) is concerned with learning a rewarding
policy by executing another policy that gathers samples of experience. While
the former policy (i.e. target policy) is rewarding but in-expressive (in most
cases, deterministic), doing well in the latter task, in contrast, requires an
expressive policy (i.e. behavior policy) that offers guided and effective
exploration. Contrary to most methods that make a trade-off between optimality
and expressiveness, disentangled frameworks explicitly decouple the two
objectives, which each is dealt with by a distinct separate policy. Although
being able to freely design and optimize the two policies with respect to their
own objectives, naively disentangling them can lead to inefficient learning or
stability issues. To mitigate this problem, our proposed method Analogous
Disentangled Actor-Critic (ADAC) designs analogous pairs of actors and critics.
Specifically, ADAC leverages a key property about Stein variational gradient
descent (SVGD) to constraint the expressive energy-based behavior policy with
respect to the target one for effective exploration. Additionally, an analogous
critic pair is introduced to incorporate intrinsic rewards in a principled
manner, with theoretical guarantees on the overall learning stability and
effectiveness. We empirically evaluate environment-reward-only ADAC on 14
continuous-control tasks and report the state-of-the-art on 10 of them. We
further demonstrate ADAC, when paired with intrinsic rewards, outperform
alternatives in exploration-challenging tasks.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward.
This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Contrastive Explanations for Comparing Preferences of Reinforcement
Learning Agents [16.605295052893986]
In complex tasks where the reward function is not straightforward, multiple reinforcement learning (RL) policies can be trained by adjusting the impact of individual objectives on reward function.
In this work we compare behavior of two policies trained on the same task, but with different preferences in objectives.
We propose a method for distinguishing between differences in behavior that stem from different abilities from those that are a consequence of opposing preferences of two RL agents.
arXiv Detail & Related papers (2021-12-17T11:57:57Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep
Reinforcement Learning [9.014110264448371]
We propose a novel unsupervised learning approach named goal-conditioned policy with intrinsic motivation (GPIM)
GPIM jointly learns both an abstract-level policy and a goal-conditioned policy.
Experiments on various robotic tasks demonstrate the effectiveness and efficiency of our proposed GPIM method.
arXiv Detail & Related papers (2021-04-11T16:26:10Z) - Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders [62.54431888432302]
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
arXiv Detail & Related papers (2020-07-27T22:19:01Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.