Agent-state based policies in POMDPs: Beyond belief-state MDPs
- URL: http://arxiv.org/abs/2409.15703v1
- Date: Tue, 24 Sep 2024 03:32:10 GMT
- Title: Agent-state based policies in POMDPs: Beyond belief-state MDPs
- Authors: Amit Sinha, Aditya Mahajan,
- Abstract summary: We present a unified treatment of some approaches to learning in POMDPs.
We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class.
We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs.
- Score: 1.918334858770111
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. However, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. We present a unified treatment of some of these approaches by viewing them as models where the agent maintains a local recursively updateable agent state and chooses actions based on the agent state. We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class. These include the designer's approach to find optimal non-stationary agent-state based policies, policy search approaches to find a locally optimal stationary agent-state based policies, and the approximate information state to find approximately optimal stationary agent-state based policies. We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs.
Related papers
- Improving Controller Generalization with Dimensionless Markov Decision Processes [6.047438841182958]
We propose a Model-Based approach to increase generalization where both world model and policy are trained in a dimensionless state-action space.
We demonstrate the applicability of our method on simulated actuated pendulum and cartpole systems, where policies trained on a single environment are robust to shifts in the distribution of the context.
arXiv Detail & Related papers (2025-04-14T09:08:53Z) - Periodic agent-state based Q-learning for POMDPs [23.296159073116264]
A widely used alternative is to use an agent state, which is a model-free,periodicly updateable function of the observation history.
We propose PA ( agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies.
By combining ideas from periodic Markov chains and approximation, we rigorously establish that PA converges to a cyclic limit and characterize the approximation error of the periodic policy.
arXiv Detail & Related papers (2024-07-08T16:58:57Z) - Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems [1.747623282473278]
We introduce a policygradient method for model reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from decision processes (MDPs) in networks.
Specifically, when the stationary distribution of the MDP is parametrized by policy parameters, we can improve existing policy methods for average-reward estimation.
arXiv Detail & Related papers (2023-12-05T14:44:58Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Local Policy Improvement for Recommender Systems [8.617221361305901]
We show how to train a new policy given data collected from a previously-deployed policy.
We suggest an alternative approach of local policy improvement without off-policy correction.
This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently.
arXiv Detail & Related papers (2022-12-22T00:47:40Z) - Plan Your Target and Learn Your Skills: Transferable State-Only
Imitation Learning via Decoupled Policy Optimization [44.32548301913779]
We introduce Decoupled Policy Optimization (DePO), which explicitly decouples the policy as a high-level state planner and an inverse dynamics model.
With embedded decoupled policy gradient and generative adversarial training, DePO enables knowledge transfer to different action spaces or state transition dynamics.
arXiv Detail & Related papers (2022-03-04T09:46:29Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.