On-line Policy Improvement using Monte-Carlo Search
- URL: http://arxiv.org/abs/2501.05407v1
- Date: Thu, 09 Jan 2025 18:05:05 GMT
- Title: On-line Policy Improvement using Monte-Carlo Search
- Authors: Gerald Tesauro, Gregory R. Galperin,
- Abstract summary: We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller.
Our algorithm is easily parallelizable and has been implemented on the IBM SP1 and SP2 parallel-RISC supercomputers.
- Score: 5.787890316746393
- License:
- Abstract: We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller. In the Monte-Carlo simulation, the long-term expected reward of each possible action is statistically measured, using the initial policy to make decisions in each step of the simulation. The action maximizing the measured expected reward is then taken, resulting in an improved policy. Our algorithm is easily parallelizable and has been implemented on the IBM SP1 and SP2 parallel-RISC supercomputers. We have obtained promising initial results in applying this algorithm to the domain of backgammon. Results are reported for a wide variety of initial policies, ranging from a random policy to TD-Gammon, an extremely strong multi-layer neural network. In each case, the Monte-Carlo algorithm gives a substantial reduction, by as much as a factor of 5 or more, in the error rate of the base players. The algorithm is also potentially useful in many other adaptive control applications in which it is possible to simulate the environment.
Related papers
- Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal.
We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths.
We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z) - Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design [18.326126953667842]
We propose novel methods that improve the data efficiency of online Monte Carlo estimators.
We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator.
We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data.
arXiv Detail & Related papers (2023-01-31T16:12:31Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Robust optimal well control using an adaptive multi-grid reinforcement
learning framework [0.0]
Reinforcement learning is a promising tool to solve robust optimal well control problems.
The proposed framework is demonstrated using a state-of-the-art, model-free policy-based RL algorithm.
Prominent gains in the computational efficiency is observed using the proposed framework saving around 60-70% of computational cost of its single fine-grid counterpart.
arXiv Detail & Related papers (2022-07-07T12:08:57Z) - Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes [3.9311044240639568]
Policy gradient (PG) is a reinforcement learning (RL) approach that optimize a parameterized policy model for an expected return using gradient ascent.
While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues.
In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL. We then explore a combined policy approach of PG and MCTL to leverage their strengths.
arXiv Detail & Related papers (2022-06-02T12:21:40Z) - Learning Optimal Antenna Tilt Control Policies: A Contextual Linear
Bandit Approach [65.27783264330711]
Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity.
We devise algorithms learning optimal tilt control policies from existing data.
We show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms.
arXiv Detail & Related papers (2022-01-06T18:24:30Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Recurrent Model Predictive Control [19.047059454849897]
We propose an off-line algorithm, called Recurrent Model Predictive Control (RMPC), to solve general nonlinear finite-horizon optimal control problems.
Our algorithm employs a recurrent function to approximate the optimal policy, which maps the system states and reference values directly to the control inputs.
arXiv Detail & Related papers (2021-02-23T15:01:36Z) - Model-Based Policy Search Using Monte Carlo Gradient Estimation with
Real Systems Application [12.854118767247453]
We present a Model-Based Reinforcement Learning (MBRL) algorithm named emphMonte Carlo Probabilistic Inference for Learning COntrol (MC-PILCO)
The algorithm relies on Gaussian Processes (GPs) to model the system dynamics and on a Monte Carlo approach to estimate the policy gradient.
Numerical comparisons in a simulated cart-pole environment show that MC-PILCO exhibits better data efficiency and control performance.
arXiv Detail & Related papers (2021-01-28T17:01:15Z) - Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation.
We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.