Quantification before Selection: Active Dynamics Preference for Robust
Reinforcement Learning
- URL: http://arxiv.org/abs/2209.11596v3
- Date: Sat, 20 May 2023 06:17:07 GMT
- Title: Quantification before Selection: Active Dynamics Preference for Robust
Reinforcement Learning
- Authors: Kang Xu, Yan Ma, Wei Li
- Abstract summary: We introduce Active Dynamics Preference(ADP), which quantifies the informativeness and density of sampled system parameters.
We validate our approach in four robotic locomotion tasks with various discrepancies between the training and testing environments.
- Score: 5.720802072821204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a robust policy is critical for policy deployment in real-world
systems or dealing with unknown dynamics mismatch in different dynamic systems.
Domain Randomization~(DR) is a simple and elegant approach that trains a
conservative policy to counter different dynamic systems without expert
knowledge about the target system parameters. However, existing works reveal
that the policy trained through DR tends to be over-conservative and performs
poorly in target domains. Our key insight is that dynamic systems with
different parameters provide different levels of difficulty for the policy, and
the difficulty of behaving well in a system is constantly changing due to the
evolution of the policy. If we can actively sample the systems with proper
difficulty for the policy on the fly, it will stabilize the training process
and prevent the policy from becoming over-conservative or over-optimistic. To
operationalize this idea, we introduce Active Dynamics Preference~(ADP), which
quantifies the informativeness and density of sampled system parameters. ADP
actively selects system parameters with high informativeness and low density.
We validate our approach in four robotic locomotion tasks with various
discrepancies between the training and testing environments. Extensive results
demonstrate that our approach has superior robustness for system inconsistency
compared to several baselines.
Related papers
- OMPO: A Unified Framework for RL under Policy and Dynamics Shifts [42.57662196581823]
Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge.
Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors.
In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching.
arXiv Detail & Related papers (2024-05-29T13:36:36Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - End-to-End Stable Imitation Learning via Autonomous Neural Dynamic
Policies [2.7941001040182765]
State-of-the-art sensorimotor learning algorithms offer policies that can often produce unstable behaviors.
Traditional robot learning relies on dynamical system-based policies that can be analyzed for stability/safety.
In this work, we bridge the gap between generic neural network policies and dynamical system-based policies.
arXiv Detail & Related papers (2023-05-22T10:10:23Z) - Non-Parametric Stochastic Policy Gradient with Strategic Retreat for
Non-Stationary Environment [1.5229257192293197]
We propose a systematic methodology to learn a sequence of optimal control policies non-parametrically.
Our methodology has outperformed the well-established DDPG and TD3 methodology by a sizeable margin in terms of learning performance.
arXiv Detail & Related papers (2022-03-24T21:41:13Z) - Robust Policy Learning over Multiple Uncertainty Sets [91.67120465453179]
Reinforcement learning (RL) agents need to be robust to variations in safety-critical environments.
We develop an algorithm that enjoys the benefits of both system identification and robust RL.
arXiv Detail & Related papers (2022-02-14T20:06:28Z) - Learning Robust Policy against Disturbance in Transition Dynamics via
State-Conservative Policy Optimization [63.75188254377202]
Deep reinforcement learning algorithms can perform poorly in real-world tasks due to discrepancy between source and target environments.
We propose a novel model-free actor-critic algorithm to learn robust policies without modeling the disturbance in advance.
Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.
arXiv Detail & Related papers (2021-12-20T13:13:05Z) - Learning a subspace of policies for online adaptation in Reinforcement
Learning [14.7945053644125]
In control systems, the robot on which a policy is learned might differ from the robot on which a policy will run.
There is a need to develop RL methods that generalize well to variations of the training conditions.
In this article, we consider the simplest yet hard to tackle generalization setting where the test environment is unknown at train time.
arXiv Detail & Related papers (2021-10-11T11:43:34Z) - Hierarchical Neural Dynamic Policies [50.969565411919376]
We tackle the problem of generalization to unseen configurations for dynamic tasks in the real world while learning from high-dimensional image input.
We use hierarchical deep policy learning framework called Hierarchical Neural Dynamical Policies (H-NDPs)
H-NDPs form a curriculum by learning local dynamical system-based policies on small regions in state-space.
We show that H-NDPs are easily integrated with both imitation as well as reinforcement learning setups and achieve state-of-the-art results.
arXiv Detail & Related papers (2021-07-12T17:59:58Z) - Neural Dynamic Policies for End-to-End Sensorimotor Learning [51.24542903398335]
The current dominant paradigm in sensorimotor control, whether imitation or reinforcement learning, is to train policies directly in raw action spaces.
We propose Neural Dynamic Policies (NDPs) that make predictions in trajectory distribution space.
NDPs outperform the prior state-of-the-art in terms of either efficiency or performance across several robotic control tasks.
arXiv Detail & Related papers (2020-12-04T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.