Direct Random Search for Fine Tuning of Deep Reinforcement Learning
Policies
- URL: http://arxiv.org/abs/2109.05604v1
- Date: Sun, 12 Sep 2021 20:12:46 GMT
- Title: Direct Random Search for Fine Tuning of Deep Reinforcement Learning
Policies
- Authors: Sean Gillen, Asutay Ozmen, Katie Byl
- Abstract summary: We show that a direct random search is very effective at fine-tuning DRL policies by directly optimizing them using deterministic rollouts.
Our results show that this method yields more consistent and higher performing agents on the environments we tested.
- Score: 5.543220407902113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Researchers have demonstrated that Deep Reinforcement Learning (DRL) is a
powerful tool for finding policies that perform well on complex robotic
systems. However, these policies are often unpredictable and can induce highly
variable behavior when evaluated with only slightly different initial
conditions. Training considerations constrain DRL algorithm designs in that
most algorithms must use stochastic policies during training. The resulting
policy used during deployment, however, can and frequently is a deterministic
one that uses the Maximum Likelihood Action (MLA) at each step. In this work,
we show that a direct random search is very effective at fine-tuning DRL
policies by directly optimizing them using deterministic rollouts. We
illustrate this across a large collection of reinforcement learning
environments, using a wide variety of policies obtained from different
algorithms. Our results show that this method yields more consistent and higher
performing agents on the environments we tested. Furthermore, we demonstrate
how this method can be used to extend our previous work on shrinking the
dimensionality of the reachable state space of closed-loop systems run under
Deep Neural Network (DNN) policies.
Related papers
- CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning [25.071018803326254]
Distribution shift is a major obstacle in offline reinforcement learning.
Previous conservative offline RL algorithms struggle to generalize to unseen actions.
We propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions.
arXiv Detail & Related papers (2024-06-11T17:59:29Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Model-based Safe Deep Reinforcement Learning via a Constrained Proximal
Policy Optimization Algorithm [4.128216503196621]
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner.
We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
arXiv Detail & Related papers (2022-10-14T06:53:02Z) - Exploration via Planning for Information about the Optimal Trajectory [67.33886176127578]
We develop a method that allows us to plan for exploration while taking the task and the current knowledge into account.
We demonstrate that our method learns strong policies with 2x fewer samples than strong exploration baselines.
arXiv Detail & Related papers (2022-10-06T20:28:55Z) - Verifying Learning-Based Robotic Navigation Systems [61.01217374879221]
We show how modern verification engines can be used for effective model selection.
Specifically, we use verification to detect and rule out policies that may demonstrate suboptimal behavior.
Our work is the first to demonstrate the use of verification backends for recognizing suboptimal DRL policies in real-world robots.
arXiv Detail & Related papers (2022-05-26T17:56:43Z) - POLTER: Policy Trajectory Ensemble Regularization for Unsupervised
Reinforcement Learning [30.834631947104498]
We present POLTER - a method to regularize the pretraining that can be applied to any URL algorithm.
We evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains.
We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case.
arXiv Detail & Related papers (2022-05-23T14:42:38Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.