Average-Reward Learning and Planning with Options
- URL: http://arxiv.org/abs/2110.13855v1
- Date: Tue, 26 Oct 2021 16:58:05 GMT
- Title: Average-Reward Learning and Planning with Options
- Authors: Yi Wan, Abhishek Naik, Richard S. Sutton
- Abstract summary: We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs.
Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as sample-based planning variants of our learning algorithms.
- Score: 9.258958295945467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We extend the options framework for temporal abstraction in reinforcement
learning from discounted Markov decision processes (MDPs) to average-reward
MDPs. Our contributions include general convergent off-policy inter-option
learning algorithms, intra-option algorithms for learning values and models, as
well as sample-based planning variants of our learning algorithms. Our
algorithms and convergence proofs extend those recently developed by Wan, Naik,
and Sutton. We also extend the notion of option-interrupting behavior from the
discounted to the average-reward formulation. We show the efficacy of the
proposed algorithms with experiments on a continuing version of the Four-Room
domain.
Related papers
- Horseshoe-type Priors for Independent Component Estimation [0.4987670632802289]
Independent Component Estimation (ICE) has many applications in modern day machine learning.
Horseshoe-type priors are used to provide scalable algorithms.
We show how to implement conditional posteriors and envelope-based methods for optimization.
arXiv Detail & Related papers (2024-06-24T18:18:58Z) - Distributional Bellman Operators over Mean Embeddings [37.5480897544168]
We propose a novel framework for distributional reinforcement learning, based on learning finite-dimensional mean embeddings of return distributions.
We derive several new algorithms for dynamic programming and temporal-difference learning based on this framework.
arXiv Detail & Related papers (2023-12-09T11:36:14Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - Stochastic Unrolled Federated Learning [85.6993263983062]
We introduce UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning.
Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolleds and the decentralized nature of federated learning.
arXiv Detail & Related papers (2023-05-24T17:26:22Z) - Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both
Worlds in Stochastic and Deterministic Environments [48.96971760679639]
We study variance-dependent regret bounds for Markov decision processes (MDPs)
We propose two new environment norms to characterize the fine-grained variance properties of the environment.
For model-based methods, we design a variant of the MVP algorithm.
In particular, this bound is simultaneously minimax optimal for both and deterministic MDPs.
arXiv Detail & Related papers (2023-01-31T06:54:06Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Safe Learning and Optimization Techniques: Towards a Survey of the State
of the Art [3.6954802719347413]
Safe learning and optimization deals with learning and optimization problems that avoid, as much as possible, the evaluation of non-safe input points.
A comprehensive survey of safe reinforcement learning algorithms was published in 2015, but related works in active learning and in optimization were not considered.
This paper reviews those algorithms from a number of domains including reinforcement learning, Gaussian process regression and classification, evolutionary algorithms, and active learning.
arXiv Detail & Related papers (2021-01-23T13:58:09Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs.
This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.