Control with adaptive Q-learning
- URL: http://arxiv.org/abs/2011.02141v1
- Date: Tue, 3 Nov 2020 18:58:55 GMT
- Title: Control with adaptive Q-learning
- Authors: Jo\~ao Pedro Ara\'ujo and M\'ario A. T. Figueiredo and Miguel Ayala
Botto
- Abstract summary: This paper evaluates two algorithms for efficient model-free episodic reinforcement learning (RL)
AQL adaptively partitions the state-action space of a Markov decision process (MDP), while learning the control policy.
SPAQL learns time-invariant policies, where the mapping from states to actions does not depend explicitly on the time step.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper evaluates adaptive Q-learning (AQL) and single-partition adaptive
Q-learning (SPAQL), two algorithms for efficient model-free episodic
reinforcement learning (RL), in two classical control problems (Pendulum and
Cartpole). AQL adaptively partitions the state-action space of a Markov
decision process (MDP), while learning the control policy, i. e., the mapping
from states to actions. The main difference between AQL and SPAQL is that the
latter learns time-invariant policies, where the mapping from states to actions
does not depend explicitly on the time step. This paper also proposes the SPAQL
with terminal state (SPAQL-TS), an improved version of SPAQL tailored for the
design of regulators for control problems. The time-invariant policies are
shown to result in a better performance than the time-variant ones in both
problems studied. These algorithms are particularly fitted to RL problems where
the action space is finite, as is the case with the Cartpole problem. SPAQL-TS
solves the OpenAI Gym Cartpole problem, while also displaying a higher sample
efficiency than trust region policy optimization (TRPO), a standard RL
algorithm for solving control tasks. Moreover, the policies learned by SPAQL
are interpretable, while TRPO policies are typically encoded as neural
networks, and therefore hard to interpret. Yielding interpretable policies
while being sample-efficient are the major advantages of SPAQL.
Related papers
- AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained Optimization [9.050431569438636]
Implicit Q-learning serves as a strong baseline for offline RL.
We introduce a different way to solve the implicit policy-finding problem (IPF) by formulating this problem as an optimization problem.
Compared with IQL and IDQL, we find our method keeps the simplicity of IQL and solves the implicit policy-finding problem.
arXiv Detail & Related papers (2024-05-28T14:01:03Z) - Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion
Policies [72.4573167739712]
Implicit Q-learning (IQL) trains a Q-function using only dataset actions through a modified Bellman backup.
It is unclear which policy actually attains the values represented by this trained Q-function.
We introduce Implicit Q-learning (IDQL), combining our general IQL critic with the policy extraction method.
arXiv Detail & Related papers (2023-04-20T18:04:09Z) - Offline RL with No OOD Actions: In-Sample Learning via Implicit Value
Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples.
We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework.
We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z) - Distributed-Training-and-Execution Multi-Agent Reinforcement Learning
for Power Control in HetNet [48.96004919910818]
We propose a multi-agent deep reinforcement learning (MADRL) based power control scheme for the HetNet.
To promote cooperation among agents, we develop a penalty-based Q learning (PQL) algorithm for MADRL systems.
In this way, an agent's policy can be learned by other agents more easily, resulting in a more efficient collaboration process.
arXiv Detail & Related papers (2022-12-15T17:01:56Z) - Processing Network Controls via Deep Reinforcement Learning [0.0]
dissertation is concerned with theoretical justification and practical application of the advanced policy gradient algorithms.
Policy improvement bounds play a crucial role in the theoretical justification of the APG algorithms.
arXiv Detail & Related papers (2022-05-01T04:34:21Z) - Single-partition adaptive Q-learning [0.0]
Single- Partition adaptive Q-learning (SPAQL) is an algorithm for model-free episodic reinforcement learning.
Tests on episodes with a large number of time steps show that SPAQL has no problems scaling, unlike adaptive Q-learning (AQL)
We claim that SPAQL may have a higher sample efficiency than AQL, thus being a relevant contribution to the field of efficient model-free RL methods.
arXiv Detail & Related papers (2020-07-14T00:03:25Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.