Related papers: On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts

On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts

URL: http://arxiv.org/abs/2007.10916v1
Date: Tue, 21 Jul 2020 16:19:09 GMT
Title: On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts
Authors: Jun Liu
Abstract summary: A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring States (MCES) method. We investigate the convergence of this algorithm for the case with undiscounted costs, also known as the shortest path problem. As a side result, we also provide a proof of a version of the supermartingale convergence theorem commonly used in approximation.
Score: 5.137144629366217
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring States (MCES) method, also known as optimistic policy iteration, in which the value function is approximated by simulated returns and a greedy policy is selected at each iteration. The convergence of this algorithm in the general setting has been an open question. In this paper, we investigate the convergence of this algorithm for the case with undiscounted costs, also known as the stochastic shortest path problem. The results complement existing partial results on this topic and thereby helps further settle the open problem. As a side result, we also provide a proof of a version of the supermartingale convergence theorem commonly used in stochastic approximation.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning [0.0]
We prove a novel result on the convergence rate of the policy algorithm. We show that the algorithm returns an optimal policy after $tildeO(SAK3log3frac1delta)$ sampled episodes.
arXiv Detail & Related papers (2024-10-03T21:11:29Z)
Automatic Rao-Blackwellization for Sequential Monte Carlo with Belief Propagation [4.956977275061968]
Exact Bayesian inference on state-space models(SSM) is in general untractable. We propose a mixed inference algorithm that computes closed-form solutions using belief propagation as much as possible.
arXiv Detail & Related papers (2023-12-15T15:05:25Z)
Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z)
Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z)
Regret Bounds for Expected Improvement Algorithms in Gaussian Process Bandit Optimization [63.8557841188626]
The expected improvement (EI) algorithm is one of the most popular strategies for optimization under uncertainty. We propose a variant of EI with a standard incumbent defined via the GP predictive mean. We show that our algorithm converges, and achieves a cumulative regret bound of $mathcal O(gamma_TsqrtT)$.
arXiv Detail & Related papers (2022-03-15T13:17:53Z)
Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs) Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals. We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z)
Plug-And-Play Learned Gaussian-mixture Approximate Message Passing [71.74028918819046]
We propose a plug-and-play compressed sensing (CS) recovery algorithm suitable for any i.i.d. source prior. Our algorithm builds upon Borgerding's learned AMP (LAMP), yet significantly improves it by adopting a universal denoising function within the algorithm. Numerical evaluation shows that the L-GM-AMP algorithm achieves state-of-the-art performance without any knowledge of the source prior.
arXiv Detail & Related papers (2020-11-18T16:40:45Z)
Pathwise Conditioning of Gaussian Processes [72.61885354624604]
Conventional approaches for simulating Gaussian process posteriors view samples as draws from marginal distributions of process values at finite sets of input locations. This distribution-centric characterization leads to generative strategies that scale cubically in the size of the desired random vector. We show how this pathwise interpretation of conditioning gives rise to a general family of approximations that lend themselves to efficiently sampling Gaussian process posteriors.
arXiv Detail & Related papers (2020-11-08T17:09:37Z)
Monte-Carlo Tree Search as Regularized Policy Optimization [47.541849128047865]
We show that AlphaZero's search algorithms are an approximation to the solution of a specific regularized policy optimization problem. We propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.
arXiv Detail & Related papers (2020-07-24T13:01:34Z)
Variational Policy Gradient Method for Reinforcement Learning with General Utilities [38.54243339632217]
In recent years, reinforcement learning systems with general goals beyond a cumulative sum of rewards have gained traction. In this paper, we consider policy in Decision Problems, where the objective converges a general concave utility function. We derive a new Variational Policy Gradient Theorem for RL with general utilities.
arXiv Detail & Related papers (2020-07-04T17:51:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.