Training Transition Policies via Distribution Matching for Complex Tasks
- URL: http://arxiv.org/abs/2110.04357v1
- Date: Fri, 8 Oct 2021 19:57:37 GMT
- Title: Training Transition Policies via Distribution Matching for Complex Tasks
- Authors: Ju-Seung Byun, Andrew Perrault
- Abstract summary: hierarchical reinforcement learning seeks to leverage lower-level policies for simple tasks to solve complex ones.
We introduce transition policies that smoothly connect lower-level policies by producing a distribution of states and actions that matches what is expected by the next policy.
We show that it smoothly connects the lower-level policies, achieving higher success rates than previous methods.
- Score: 7.310043452300736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans decompose novel complex tasks into simpler ones to exploit previously
learned skills. Analogously, hierarchical reinforcement learning seeks to
leverage lower-level policies for simple tasks to solve complex ones. However,
because each lower-level policy induces a different distribution of states,
transitioning from one lower-level policy to another may fail due to an
unexpected starting state. We introduce transition policies that smoothly
connect lower-level policies by producing a distribution of states and actions
that matches what is expected by the next policy. Training transition policies
is challenging because the natural reward signal -- whether the next policy can
execute its subtask successfully -- is sparse. By training transition policies
via adversarial inverse reinforcement learning to match the distribution of
expected states and actions, we avoid relying on task-based reward. To further
improve performance, we use deep Q-learning with a binary action space to
determine when to switch from a transition policy to the next pre-trained
policy, using the success or failure of the next subtask as the reward.
Although the reward is still sparse, the problem is less severe due to the
simple binary action space. We demonstrate our method on continuous bipedal
locomotion and arm manipulation tasks that require diverse skills. We show that
it smoothly connects the lower-level policies, achieving higher success rates
than previous methods that search for successful trajectories based on a reward
function, but do not match the state distribution.
Related papers
- Time-Efficient Reinforcement Learning with Stochastic Stateful Policies [20.545058017790428]
We present a novel approach for training stateful policies by decomposing the latter into a gradient internal state kernel and a stateless policy.
We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning algorithms.
arXiv Detail & Related papers (2023-11-07T15:48:07Z) - IOB: Integrating Optimization Transfer and Behavior Transfer for
Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task.
Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions.
We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z) - A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting.
We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations.
Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Adversarial Skill Chaining for Long-Horizon Robot Manipulation via
Terminal State Regularization [65.09725599705493]
We propose to chain multiple policies without excessively large initial state distributions.
We evaluate our approach on two complex long-horizon manipulation tasks of furniture assembly.
Our results have shown that our method establishes the first model-free reinforcement learning algorithm to solve these tasks.
arXiv Detail & Related papers (2021-11-15T18:59:03Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.