Reward-Conditioned Policies
- URL: http://arxiv.org/abs/1912.13465v1
- Date: Tue, 31 Dec 2019 18:07:43 GMT
- Title: Reward-Conditioned Policies
- Authors: Aviral Kumar, Xue Bin Peng, Sergey Levine
- Abstract summary: imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
- Score: 100.64167842905069
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning offers the promise of automating the acquisition of
complex behavioral skills. However, compared to commonly used and
well-understood supervised learning methods, reinforcement learning algorithms
can be brittle, difficult to use and tune, and sensitive to seemingly innocuous
implementation decisions. In contrast, imitation learning utilizes standard and
well-understood supervised learning methods, but requires near-optimal expert
data. Can we learn effective policies via supervised learning without
demonstrations? The main idea that we explore in this work is that non-expert
trajectories collected from sub-optimal policies can be viewed as optimal
supervision, not for maximizing the reward, but for matching the reward of the
given trajectory. By then conditioning the policy on the numerical value of the
reward, we can obtain a policy that generalizes to larger returns. We show how
such an approach can be derived as a principled method for policy search,
discuss several variants, and compare the method experimentally to a variety of
current reinforcement learning methods on standard benchmarks.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Language-Conditioned Semantic Search-Based Policy for Robotic
Manipulation Tasks [2.1332830068386217]
We propose a language-conditioned semantic search-based method to produce an online search-based policy.
Our approach surpasses the performance of the baselines on the CALVIN benchmark and exhibits strong zero-shot adaptation capabilities.
arXiv Detail & Related papers (2023-12-10T16:17:00Z) - RLIF: Interactive Imitation Learning as Reinforcement Learning [56.997263135104504]
We show how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning.
Our proposed method uses reinforcement learning with user intervention signals themselves as rewards.
This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert.
arXiv Detail & Related papers (2023-11-21T21:05:21Z) - Deconfounding Imitation Learning with Variational Inference [19.99248795957195]
Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent.
This is because partial observability gives rise to hidden confounders in the causal graph.
We propose to train a variational inference model to infer the expert's latent information and use it to train a latent-conditional policy.
arXiv Detail & Related papers (2022-11-04T18:00:02Z) - Online Learning with Off-Policy Feedback [18.861989132159945]
We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback.
We propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy.
arXiv Detail & Related papers (2022-07-18T21:57:16Z) - Imitating Past Successes can be Very Suboptimal [145.70788608016755]
We show that existing outcome-conditioned imitation learning methods do not necessarily improve the policy.
We show that a simple modification results in a method that does guarantee policy improvement.
Our aim is not to develop an entirely new method, but rather to explain how a variant of outcome-conditioned imitation learning can be used to maximize rewards.
arXiv Detail & Related papers (2022-06-07T15:13:43Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Regularized Inverse Reinforcement Learning [49.78352058771138]
Inverse Reinforcement Learning (IRL) aims to facilitate a learner's ability to imitate expert behavior.
Regularized IRL applies strongly convex regularizers to the learner's policy.
We propose tractable solutions, and practical methods to obtain them, for regularized IRL.
arXiv Detail & Related papers (2020-10-07T23:38:47Z) - Average Reward Adjusted Discounted Reinforcement Learning:
Near-Blackwell-Optimal Policies for Real-World Applications [0.0]
Reinforcement learning aims at finding the best stationary policy for a given Markov Decision Process.
This paper provides deep theoretical insights to the widely applied standard discounted reinforcement learning framework.
We establish a novel near-Blackwell-optimal reinforcement learning algorithm.
arXiv Detail & Related papers (2020-04-02T08:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.