Policy Search using Dynamic Mirror Descent MPC for Model Free Off Policy
RL
- URL: http://arxiv.org/abs/2110.12239v1
- Date: Sat, 23 Oct 2021 15:16:49 GMT
- Title: Policy Search using Dynamic Mirror Descent MPC for Model Free Off Policy
RL
- Authors: Soumya Rani Samineni
- Abstract summary: Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches.
We propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL
algorithms with model-based (Mb)-RL approaches to get the best from both:
asymptotic performance of Mf-RL and high sample-efficiency of Mb-RL. Inspired
by these works, we propose a hierarchical framework that integrates online
learning for the Mb-trajectory optimization with off-policy methods for the
Mf-RL. In particular, two loops are proposed, where the Dynamic Mirror Descent
based Model Predictive Control (DMD-MPC) is used as the inner loop to obtain an
optimal sequence of actions. These actions are in turn used to significantly
accelerate the outer loop Mf-RL. We show that our formulation is generic for a
broad class of MPC based policies and objectives, and includes some of the
well-known Mb-Mf approaches. Based on the framework we define two algorithms to
increase sample efficiency of Off Policy RL and to guide end to end RL
algorithms for online adaption respectively. Thus we finally introduce two
novel algorithms: Dynamic-Mirror Descent Model Predictive RL(DeMoRL), which
uses the method of elite fractions for the inner loop and Soft Actor-Critic
(SAC) as the off-policy RL for the outer loop and Dynamic-Mirror Descent Model
Predictive Layer(DeMo Layer), a special case of the hierarchical framework
which guides linear policies trained using Augmented Random Search(ARS). Our
experiments show faster convergence of the proposed DeMo RL, and better or
equal performance compared to other Mf-Mb approaches on benchmark MuJoCo
control tasks. The DeMo Layer was tested on classical Cartpole and custom-built
Quadruped trained using Linear Policy.
Related papers
- Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review [63.31328039424469]
This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions.
We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning.
arXiv Detail & Related papers (2024-07-18T17:35:32Z) - Operator World Models for Reinforcement Learning [37.69110422996011]
Policy Mirror Descent is not directly applicable to Reinforcement Learning (RL)
We introduce a novel approach based on learning a world model of the environment using conditional mean embeddings.
We then leverage the operatorial formulation of RL to express the action-value function in terms of this quantity in closed form via matrix operations.
arXiv Detail & Related papers (2024-06-28T12:05:47Z) - ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [80.10358123795946]
We develop a framework for building multi-turn RL algorithms for fine-tuning large language models.
Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel.
Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
arXiv Detail & Related papers (2024-02-29T18:45:56Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Deep Black-Box Reinforcement Learning with Movement Primitives [15.184283143878488]
We present a new algorithm for deep reinforcement learning (RL)
It is based on differentiable trust region layers, a successful on-policy deep RL algorithm.
We compare our ERL algorithm to state-of-the-art step-based algorithms in many complex simulated robotic control tasks.
arXiv Detail & Related papers (2022-10-18T06:34:52Z) - Model Predictive Control via On-Policy Imitation Learning [28.96122879515294]
We develop new sample complexity results and performance guarantees for data-driven Model Predictive Control.
Our algorithm uses the structure of constrained linear MPC, and our analysis uses the properties of the explicit MPC solution to theoretically bound the number of online MPC trajectories needed to achieve optimal performance.
arXiv Detail & Related papers (2022-10-17T16:06:06Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Behavioral Priors and Dynamics Models: Improving Performance and Domain
Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE)
MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary.
In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z) - Pareto Deterministic Policy Gradients and Its Application in 5G Massive
MIMO Networks [32.099949375036495]
We consider jointly optimizing cell load balance and network throughput via a reinforcement learning (RL) approach.
Our rationale behind using RL is to circumvent the challenges of analytically modeling user mobility and network dynamics.
To accomplish this joint optimization, we integrate vector rewards into the RL value network and conduct RL action via a separate policy network.
arXiv Detail & Related papers (2020-12-02T15:35:35Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.