Residual Q-Learning: Offline and Online Policy Customization without
Value
- URL: http://arxiv.org/abs/2306.09526v3
- Date: Mon, 15 Jan 2024 04:37:48 GMT
- Title: Residual Q-Learning: Offline and Online Policy Customization without
Value
- Authors: Chenran Li, Chen Tang, Haruki Nishimura, Jean Mercat, Masayoshi
Tomizuka, Wei Zhan
- Abstract summary: Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations.
We formulate a new problem setting called policy customization.
We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy.
- Score: 53.47311900133564
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Imitation Learning (IL) is a widely used framework for learning imitative
behavior from demonstrations. It is especially appealing for solving complex
real-world tasks where handcrafting reward function is difficult, or when the
goal is to mimic human expert behavior. However, the learned imitative policy
can only follow the behavior in the demonstration. When applying the imitative
policy, we may need to customize the policy behavior to meet different
requirements coming from diverse downstream tasks. Meanwhile, we still want the
customized policy to maintain its imitative nature. To this end, we formulate a
new problem setting called policy customization. It defines the learning task
as training a policy that inherits the characteristics of the prior policy
while satisfying some additional requirements imposed by a target downstream
task. We propose a novel and principled approach to interpret and determine the
trade-off between the two task objectives. Specifically, we formulate the
customization problem as a Markov Decision Process (MDP) with a reward function
that combines 1) the inherent reward of the demonstration; and 2) the add-on
reward specified by the downstream task. We propose a novel framework, Residual
Q-learning, which can solve the formulated MDP by leveraging the prior policy
without knowing the inherent reward or value function of the prior policy. We
derive a family of residual Q-learning algorithms that can realize offline and
online policy customization, and show that the proposed algorithms can
effectively accomplish policy customization tasks in various environments. Demo
videos and code are available on our website:
https://sites.google.com/view/residualq-learning.
Related papers
- Online Policy Distillation with Decision-Attention [23.807761525617384]
Policy Distillation (PD) has become an effective method to improve deep reinforcement learning tasks.
We study the knowledge transfer between different policies that can learn diverse knowledge from the same environment.
We propose Online Policy Distillation (OPD) with Decision-Attention (DA)
arXiv Detail & Related papers (2024-06-08T14:40:53Z) - On the Value of Myopic Behavior in Policy Reuse [67.37788288093299]
Leveraging learned strategies in unfamiliar scenarios is fundamental to human intelligence.
In this work, we present a framework called Selective Myopic bEhavior Control(SMEC)
SMEC adaptively aggregates the sharable short-term behaviors of prior policies and the long-term behaviors of the task policy, leading to coordinated decisions.
arXiv Detail & Related papers (2023-05-28T03:59:37Z) - Optimistic Linear Support and Successor Features as a Basis for Optimal
Policy Transfer [7.970144204429356]
We introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set.
We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks.
arXiv Detail & Related papers (2022-06-22T19:00:08Z) - Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in
Latent Space [76.46113138484947]
General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments.
To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach goals for a wide range of tasks on command.
We propose Planning to Practice, a method that makes it practical to train goal-conditioned policies for long-horizon tasks.
arXiv Detail & Related papers (2022-05-17T06:58:17Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep
Reinforcement Learning [9.014110264448371]
We propose a novel unsupervised learning approach named goal-conditioned policy with intrinsic motivation (GPIM)
GPIM jointly learns both an abstract-level policy and a goal-conditioned policy.
Experiments on various robotic tasks demonstrate the effectiveness and efficiency of our proposed GPIM method.
arXiv Detail & Related papers (2021-04-11T16:26:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.