MBB: Model-Based Baseline for Global Guidance of Model-Free
Reinforcement Learning via Lower-Dimensional Solutions
- URL: http://arxiv.org/abs/2011.02073v4
- Date: Sat, 23 Oct 2021 00:28:56 GMT
- Title: MBB: Model-Based Baseline for Global Guidance of Model-Free
Reinforcement Learning via Lower-Dimensional Solutions
- Authors: Xubo Lyu, Site Li, Seth Siriya, Ye Pu, Mo Chen
- Abstract summary: We show how to solve complex robotic tasks with hi-dim state spaces.
First, we compute a lo-dim value function for the lo-dim version of the problem.
Then, the lo-dim value function is used as a baseline function to warm-start the model-free RL process.
- Score: 8.6216807235051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One spectrum on which robotic control paradigms lie is the degree in which a
model of the environment is involved, from methods that are completely
model-free such as model-free RL, to methods that require a known model such as
optimal control, with other methods such as model-based RL somewhere in the
middle. On one end of the spectrum, model-free RL can learn control policies
for high-dimensional (hi-dim), complex robotic tasks through trial-and-error
without knowledge of a model of the environment, but tends to require a large
amount of data. On the other end, "classical methods" such as optimal control
generate solutions without collecting data, but assume that an accurate model
of the system and environment is known and are mostly limited to problems with
low-dimensional (lo-dim) state spaces. In this paper, we bring the two ends of
the spectrum together. Although models of hi-dim systems and environments may
not exist, lo-dim approximations of these systems and environments are widely
available, especially in robotics. Therefore, we propose to solve hi-dim,
complex robotic tasks in two stages. First, assuming a coarse model of the
hi-dim system, we compute a lo-dim value function for the lo-dim version of the
problem using classical methods (eg. value iteration and optimal control).
Then, the lo-dim value function is used as a baseline function to warm-start
the model-free RL process that learns hi-dim policies. The lo-dim value
function provides global guidance for model-free RL, alleviating the data
inefficiency of model-free RL. We demonstrate our approach on two robot
learning tasks with hi-dim state spaces and observe significant improvement in
policy performance and learning efficiency. We also give an empirical analysis
of our method with a third task.
Related papers
- KIPPO: Koopman-Inspired Proximal Policy Optimization [4.46358470535211]
Reinforcement Learning (RL) has made significant strides in various domains.<n>Policy gradient methods like Proximal Policy (PPO) have gained popularity due to their balance in performance, stability, and computational efficiency.
arXiv Detail & Related papers (2025-05-20T16:25:41Z) - Preference-Guided Reinforcement Learning for Efficient Exploration [7.83845308102632]
We introduce LOPE: Learning Online with trajectory Preference guidancE, an end-to-end preference-guided RL framework.
Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance.
LOPE outperforms several state-of-the-art methods regarding convergence rate and overall performance.
arXiv Detail & Related papers (2024-07-09T02:11:12Z) - Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline [47.16115174891401]
We propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue.
We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.
arXiv Detail & Related papers (2024-05-04T05:21:28Z) - Discovering Behavioral Modes in Deep Reinforcement Learning Policies
Using Trajectory Clustering in Latent Space [0.0]
We introduce a new approach for investigating the behavior modes of DRL policies.
Specifically, we use Pairwise Controlled Manifold Approximation Projection (PaCMAP) for dimensionality reduction and TRACLUS for trajectory clustering.
Our methodology helps identify diverse behavior patterns and suboptimal choices by the policy, thus allowing for targeted improvements.
arXiv Detail & Related papers (2024-02-20T11:50:50Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Policy Gradient for Reinforcement Learning with General Utilities [50.65940899590487]
In Reinforcement Learning (RL), the goal of agents is to discover an optimal policy that maximizes the expected cumulative rewards.
Many supervised and unsupervised RL problems are not covered in the Linear RL framework.
We derive the policy gradient theorem for RL with general utilities.
arXiv Detail & Related papers (2022-10-03T14:57:46Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Improving Actor-Critic Reinforcement Learning via Hamiltonian Policy [11.34520632697191]
Approximating optimal policies in reinforcement learning (RL) is often necessary in many real-world scenarios.
In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate policy optimization with HMC.
We show that the proposed approach is a data-efficient, and an easy-to-implement improvement over previous policy optimization methods.
arXiv Detail & Related papers (2021-03-22T17:26:43Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - Inverse Reinforcement Learning from a Gradient-based Learner [41.8663538249537]
Inverse Reinforcement Learning addresses the problem of inferring an expert's reward function from demonstrations.
In this paper, we propose a new algorithm for this setting, in which the goal is to recover the reward function being optimized by an agent.
arXiv Detail & Related papers (2020-07-15T16:41:00Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.