Related papers: Single-partition adaptive Q-learning

Single-partition adaptive Q-learning

URL: http://arxiv.org/abs/2007.06741v1
Date: Tue, 14 Jul 2020 00:03:25 GMT
Title: Single-partition adaptive Q-learning
Authors: Jo\~ao Pedro Ara\'ujo, M\'ario Figueiredo, Miguel Ayala Botto
Abstract summary: Single- Partition adaptive Q-learning (SPAQL) is an algorithm for model-free episodic reinforcement learning. Tests on episodes with a large number of time steps show that SPAQL has no problems scaling, unlike adaptive Q-learning (AQL) We claim that SPAQL may have a higher sample efficiency than AQL, thus being a relevant contribution to the field of efficient model-free RL methods.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces single-partition adaptive Q-learning (SPAQL), an algorithm for model-free episodic reinforcement learning (RL), which adaptively partitions the state-action space of a Markov decision process (MDP), while simultaneously learning a time-invariant policy (i. e., the mapping from states to actions does not depend explicitly on the episode time step) for maximizing the cumulative reward. The trade-off between exploration and exploitation is handled by using a mixture of upper confidence bounds (UCB) and Boltzmann exploration during training, with a temperature parameter that is automatically tuned as training progresses. The algorithm is an improvement over adaptive Q-learning (AQL). It converges faster to the optimal solution, while also using fewer arms. Tests on episodes with a large number of time steps show that SPAQL has no problems scaling, unlike AQL. Based on this empirical evidence, we claim that SPAQL may have a higher sample efficiency than AQL, thus being a relevant contribution to the field of efficient model-free RL methods.

Related papers

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
Online Statistical Inference for Time-varying Sample-averaged Q-learning [2.2374171443798034]
This paper introduces a time-varying batch-averaged Q-learning, termed sampleaveraged Q-learning. We develop a novel framework that provides insights into the normality of the sample-averaged algorithm under mild conditions. Numerical experiments conducted on classic OpenAI Gym environments show that the time-varying sample-averaged Q-learning method consistently outperforms both single-sample and constant-batch Q-learning.
arXiv Detail & Related papers (2024-10-14T17:17:19Z)
Diffusion Policies creating a Trust Region for Offline Reinforcement Learning [66.17291150498276]
We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy. DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient. We show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds.
arXiv Detail & Related papers (2024-05-30T05:04:33Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies [72.4573167739712]
Implicit Q-learning (IQL) trains a Q-function using only dataset actions through a modified Bellman backup. It is unclear which policy actually attains the values represented by this trained Q-function. We introduce Implicit Q-learning (IDQL), combining our general IQL critic with the policy extraction method.
arXiv Detail & Related papers (2023-04-20T18:04:09Z)
QLABGrad: a Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning [6.555832619920502]
We propose a novel learning rate adaptation scheme called QLABGrad. QLABGrad automatically determines the learning rate by optimizing the Quadratic Loss Approximation-Based (QLAB) function for a given gradient descent direction.
arXiv Detail & Related papers (2023-02-01T05:29:10Z)
Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic [61.968469104271676]
We propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.
arXiv Detail & Related papers (2023-01-28T04:12:56Z)
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$. Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z)
Control with adaptive Q-learning [0.0]
This paper evaluates two algorithms for efficient model-free episodic reinforcement learning (RL) AQL adaptively partitions the state-action space of a Markov decision process (MDP), while learning the control policy. SPAQL learns time-invariant policies, where the mapping from states to actions does not depend explicitly on the time step.
arXiv Detail & Related papers (2020-11-03T18:58:55Z)
Lookahead-Bounded Q-Learning [8.738692817482526]
We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning. Our approach is particularly appealing in problems that require expensive simulations or real-world interactions.
arXiv Detail & Related papers (2020-06-28T19:50:55Z)
Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
Simultaneously Evolving Deep Reinforcement Learning Models using Multifactorial Optimization [18.703421169342796]
This work proposes a framework capable of simultaneously evolving several DQL models towards solving interrelated Reinforcement Learning tasks. A thorough experimentation is presented and discussed so as to assess the performance of the framework.
arXiv Detail & Related papers (2020-02-25T10:36:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.