Related papers: Optimistic Planning by Regularized Dynamic Programming

Optimistic Planning by Regularized Dynamic Programming

URL: http://arxiv.org/abs/2302.14004v3
Date: Wed, 14 Jun 2023 12:50:26 GMT
Title: Optimistic Planning by Regularized Dynamic Programming
Authors: Antoine Moulin, Gergely Neu
Abstract summary: We propose a new method for optimistic planning in infinite-horizon discounted Markov decision processes. This technique allows us to avoid contraction and monotonicity arguments. We show it achieves near-optimal statistical guarantees.
Score: 12.411844611718958
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a new method for optimistic planning in infinite-horizon discounted Markov decision processes based on the idea of adding regularization to the updates of an otherwise standard approximate value iteration procedure. This technique allows us to avoid contraction and monotonicity arguments typically required by existing analyses of approximate dynamic programming methods, and in particular to use approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation. We use our method to recover known guarantees in tabular MDPs and to provide a computationally efficient algorithm for learning near-optimal policies in discounted linear mixture MDPs from a single stream of experience, and show it achieves near-optimal statistical guarantees.

Related papers

Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z)
Locally Optimal Descent for Dynamic Stepsize Scheduling [45.6809308002043]
We introduce a novel dynamic learning scheduling scheme grounded in theory with the goal of simplifying the manual and time-consuming tuning of schedules in phases. Our approach is based on estimating a locally-optimal practice tuning-rate in the direction of a smooth gradient. Our findings indicate that our method needs minimal tuning when compared to existing approaches.
arXiv Detail & Related papers (2023-11-23T09:57:35Z)
Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization [12.411844611718958]
We show that our method outputs a near-optimal policy after a number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector.
arXiv Detail & Related papers (2022-10-21T15:49:20Z)
Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning. We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing. We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
Making Linear MDPs Practical via Contrastive Representation Learning [101.75885788118131]
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations. We consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning. We demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
arXiv Detail & Related papers (2022-07-14T18:18:02Z)
Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z)
Stochastic convex optimization for provably efficient apprenticeship learning [1.0609815608017066]
We consider large-scale Markov decision processes (MDPs) with an unknown cost function. We employ convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations.
arXiv Detail & Related papers (2021-12-31T19:47:57Z)
Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs) We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO) PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z)
Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs) Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals. We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.