Periodic agent-state based Q-learning for POMDPs
- URL: http://arxiv.org/abs/2407.06121v3
- Date: Tue, 29 Oct 2024 03:56:17 GMT
- Title: Periodic agent-state based Q-learning for POMDPs
- Authors: Amit Sinha, Matthieu Geist, Aditya Mahajan,
- Abstract summary: A widely used alternative is to use an agent state, which is a model-free,periodicly updateable function of the observation history.
We propose PA ( agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies.
By combining ideas from periodic Markov chains and approximation, we rigorously establish that PA converges to a cyclic limit and characterize the approximation error of the periodic policy.
- Score: 23.296159073116264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.
Related papers
- Agent-state based policies in POMDPs: Beyond belief-state MDPs [1.918334858770111]
We present a unified treatment of some approaches to learning in POMDPs.
We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class.
We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs.
arXiv Detail & Related papers (2024-09-24T03:32:10Z) - SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments [18.081732498034047]
This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options.
Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem.
arXiv Detail & Related papers (2024-07-26T17:59:55Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function.
We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z) - When is Agnostic Reinforcement Learning Statistically Tractable? [76.1408672715773]
A new complexity measure, called the emphspanning capacity, depends solely on the set $Pi$ and is independent of the MDP dynamics.
We show there exists a policy class $Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn.
This reveals a surprising separation for learnability between generative access and online access models.
arXiv Detail & Related papers (2023-10-09T19:40:54Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - The Wasserstein Believer: Learning Belief Updates for Partially
Observable Environments through Reliable Latent Space Models [3.462371782084948]
We propose an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update.
Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.
arXiv Detail & Related papers (2023-03-06T16:59:14Z) - PAC Reinforcement Learning for Predictive State Representations [60.00237613646686]
We study online Reinforcement Learning (RL) in partially observable dynamical systems.
We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models.
We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scalingly.
arXiv Detail & Related papers (2022-07-12T17:57:17Z) - Solving the non-preemptive two queue polling model with generally
distributed service and switch-over durations and Poisson arrivals as a
Semi-Markov Decision Process [0.0]
The polling system with switch-over durations is a useful model with several practical applications.
It is classified as a Discrete Event Dynamic System (DEDS) for which no one agreed upon modelling approach exists.
This paper presents a Semi-Markov Decision Process (SMDP) formulation of the polling system as to introduce additional modelling power.
arXiv Detail & Related papers (2021-12-13T11:40:55Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.