Single-partition adaptive Q-learning
- URL: http://arxiv.org/abs/2007.06741v1
- Date: Tue, 14 Jul 2020 00:03:25 GMT
- Title: Single-partition adaptive Q-learning
- Authors: Jo\~ao Pedro Ara\'ujo, M\'ario Figueiredo, Miguel Ayala Botto
- Abstract summary: Single- Partition adaptive Q-learning (SPAQL) is an algorithm for model-free episodic reinforcement learning.
Tests on episodes with a large number of time steps show that SPAQL has no problems scaling, unlike adaptive Q-learning (AQL)
We claim that SPAQL may have a higher sample efficiency than AQL, thus being a relevant contribution to the field of efficient model-free RL methods.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces single-partition adaptive Q-learning (SPAQL), an
algorithm for model-free episodic reinforcement learning (RL), which adaptively
partitions the state-action space of a Markov decision process (MDP), while
simultaneously learning a time-invariant policy (i. e., the mapping from states
to actions does not depend explicitly on the episode time step) for maximizing
the cumulative reward. The trade-off between exploration and exploitation is
handled by using a mixture of upper confidence bounds (UCB) and Boltzmann
exploration during training, with a temperature parameter that is automatically
tuned as training progresses. The algorithm is an improvement over adaptive
Q-learning (AQL). It converges faster to the optimal solution, while also using
fewer arms. Tests on episodes with a large number of time steps show that SPAQL
has no problems scaling, unlike AQL. Based on this empirical evidence, we claim
that SPAQL may have a higher sample efficiency than AQL, thus being a relevant
contribution to the field of efficient model-free RL methods.
Related papers
- Online Statistical Inference for Time-varying Sample-averaged Q-learning [2.2374171443798034]
This paper introduces a time-varying batch-averaged Q-learning, termed sampleaveraged Q-learning.
We develop a novel framework that provides insights into the normality of the sample-averaged algorithm under mild conditions.
Numerical experiments conducted on classic OpenAI Gym environments show that the time-varying sample-averaged Q-learning method consistently outperforms both single-sample and constant-batch Q-learning.
arXiv Detail & Related papers (2024-10-14T17:17:19Z) - Diffusion Policies creating a Trust Region for Offline Reinforcement Learning [66.17291150498276]
We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy.
DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient.
We show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds.
arXiv Detail & Related papers (2024-05-30T05:04:33Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion
Policies [72.4573167739712]
Implicit Q-learning (IQL) trains a Q-function using only dataset actions through a modified Bellman backup.
It is unclear which policy actually attains the values represented by this trained Q-function.
We introduce Implicit Q-learning (IDQL), combining our general IQL critic with the policy extraction method.
arXiv Detail & Related papers (2023-04-20T18:04:09Z) - QLABGrad: a Hyperparameter-Free and Convergence-Guaranteed Scheme for
Deep Learning [6.555832619920502]
We propose a novel learning rate adaptation scheme called QLABGrad.
QLABGrad automatically determines the learning rate by optimizing the Quadratic Loss Approximation-Based (QLAB) function for a given gradient descent direction.
arXiv Detail & Related papers (2023-02-01T05:29:10Z) - Beyond Exponentially Fast Mixing in Average-Reward Reinforcement
Learning via Multi-Level Monte Carlo Actor-Critic [61.968469104271676]
We propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm.
We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.
arXiv Detail & Related papers (2023-01-28T04:12:56Z) - Tightening the Dependence on Horizon in the Sample Complexity of
Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$.
Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z) - Control with adaptive Q-learning [0.0]
This paper evaluates two algorithms for efficient model-free episodic reinforcement learning (RL)
AQL adaptively partitions the state-action space of a Markov decision process (MDP), while learning the control policy.
SPAQL learns time-invariant policies, where the mapping from states to actions does not depend explicitly on the time step.
arXiv Detail & Related papers (2020-11-03T18:58:55Z) - Lookahead-Bounded Q-Learning [8.738692817482526]
We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning.
Our approach is particularly appealing in problems that require expensive simulations or real-world interactions.
arXiv Detail & Related papers (2020-06-28T19:50:55Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z) - Simultaneously Evolving Deep Reinforcement Learning Models using
Multifactorial Optimization [18.703421169342796]
This work proposes a framework capable of simultaneously evolving several DQL models towards solving interrelated Reinforcement Learning tasks.
A thorough experimentation is presented and discussed so as to assess the performance of the framework.
arXiv Detail & Related papers (2020-02-25T10:36:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.