Mixed Policy Gradient: off-policy reinforcement learning driven jointly
by data and model
- URL: http://arxiv.org/abs/2102.11513v2
- Date: Sat, 24 Feb 2024 15:38:46 GMT
- Title: Mixed Policy Gradient: off-policy reinforcement learning driven jointly
by data and model
- Authors: Yang Guan, Jingliang Duan, Shengbo Eben Li, Jie Li, Jianyu Chen, Bo
Cheng
- Abstract summary: Reinforcement learning (RL) shows great potential in sequential decision-making.
Mainstream RL algorithms are data-driven, which usually yield better performance but much slower convergence compared with model-driven methods.
This paper proposes mixed policy gradient (MPG) algorithm, which fuses the empirical data and the transition model in policy gradient (PG) to accelerate convergence without performance.
- Score: 32.61834127169759
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) shows great potential in sequential
decision-making. At present, mainstream RL algorithms are data-driven, which
usually yield better asymptotic performance but much slower convergence
compared with model-driven methods. This paper proposes mixed policy gradient
(MPG) algorithm, which fuses the empirical data and the transition model in
policy gradient (PG) to accelerate convergence without performance degradation.
Formally, MPG is constructed as a weighted average of the data-driven and
model-driven PGs, where the former is the derivative of the learned Q-value
function, and the latter is that of the model-predictive return. To guide the
weight design, we analyze and compare the upper bound of each PG error. Relying
on that, a rule-based method is employed to heuristically adjust the weights.
In particular, to get a better PG, the weight of the data-driven PG is designed
to grow along the learning process while the other to decrease. Simulation
results show that the MPG method achieves the best asymptotic performance and
convergence speed compared with other baseline algorithms.
Related papers
- Model-Based Reparameterization Policy Gradient Methods: Theory and
Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics.
Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes.
We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Adaptive Latent Factor Analysis via Generalized Momentum-Incorporated
Particle Swarm Optimization [6.2303427193075755]
A gradient descent (SGD) algorithm is an effective learning strategy to build a latent factor analysis (LFA) model on a high-dimensional and incomplete (HDI) matrix.
A particle swarm optimization (PSO) algorithm is commonly adopted to make an SGD-based LFA model's hyper- parameters, i.e., learning rate and regularization coefficient, self-adaptation.
This paper incorporates more historical information into each particle's evolutionary process for avoiding premature convergence.
arXiv Detail & Related papers (2022-08-04T03:15:07Z) - Training Discrete Deep Generative Models via Gapped Straight-Through
Estimator [72.71398034617607]
We propose a Gapped Straight-Through ( GST) estimator to reduce the variance without incurring resampling overhead.
This estimator is inspired by the essential properties of Straight-Through Gumbel-Softmax.
Experiments demonstrate that the proposed GST estimator enjoys better performance compared to strong baselines on two discrete deep generative modeling tasks.
arXiv Detail & Related papers (2022-06-15T01:46:05Z) - Reinforcement Learning from Demonstrations by Novel Interactive Expert
and Application to Automatic Berthing Control Systems for Unmanned Surface
Vessel [12.453219390225428]
Two novel practical methods of Reinforcement Learning from Demonstration (RLfD) are developed and applied to automatic berthing control systems for Unmanned Surface Vessel.
A new expert data generation method, called Model Predictive Based Expert (MPBE), is developed to provide high quality supervision data for RLfD algorithms.
Another novel RLfD algorithm based on the MP-DDPG, called Self-Guided Actor-Critic (SGAC) is present, which can effectively leverage MPBE by continuously querying it to generate high quality expert data online.
arXiv Detail & Related papers (2022-02-23T06:45:59Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Modeling Stochastic Microscopic Traffic Behaviors: a Physics Regularized
Gaussian Process Approach [1.6242924916178285]
This study presents a microscopic traffic model that can capture randomness and measure errors in the real world.
Since one unique feature of the proposed framework is the capability of capturing both car-following and lane-changing behaviors with one single model, numerical tests are carried out with two separated datasets.
arXiv Detail & Related papers (2020-07-17T06:03:32Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z) - Stochastic Recursive Momentum for Policy Gradient Methods [28.277961340108313]
We propose a novel algorithm named STOchastic Recursive Momentum for Policy Gradient (Storm-PG)
Storm-PG enjoys a provably sharp $O (1/epsilon3)$ sample bound for STORM-PG, matching the best-known convergence rate for policy gradient algorithm.
Numerical experiments depicts the superiority of our algorithm over comparative policy gradient algorithms.
arXiv Detail & Related papers (2020-03-09T17:59:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.