Partially Observable Reference Policy Programming: Solving POMDPs Sans Numerical Optimisation
- URL: http://arxiv.org/abs/2507.12186v1
- Date: Wed, 16 Jul 2025 12:33:32 GMT
- Title: Partially Observable Reference Policy Programming: Solving POMDPs Sans Numerical Optimisation
- Authors: Edward Kim, Hanna Kurniawati,
- Abstract summary: This paper proposes a novel anytime online approximate POMDP solver which samples meaningful future histories very deeply.<n>We provide theoretical guarantees for the algorithm's underlying scheme which say that the performance loss is bounded by the average of the sampling approximation errors.
- Score: 4.258302855015618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes Partially Observable Reference Policy Programming, a novel anytime online approximate POMDP solver which samples meaningful future histories very deeply while simultaneously forcing a gradual policy update. We provide theoretical guarantees for the algorithm's underlying scheme which say that the performance loss is bounded by the average of the sampling approximation errors rather than the usual maximum, a crucial requirement given the sampling sparsity of online planning. Empirical evaluations on two large-scale problems with dynamically evolving environments -- including a helicopter emergency scenario in the Corsica region requiring approximately 150 planning steps -- corroborate the theoretical results and indicate that our solver considerably outperforms current online benchmarks.
Related papers
- Sequential Monte Carlo for Policy Optimization in Continuous POMDPs [9.690099639375456]
We introduce a novel policy optimization framework for continuous partially observable Markov decision processes (POMDPs)<n>Our method casts policy learning as probabilistic inference in a non-Markovian Feynman--Kac model.<n>We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks.
arXiv Detail & Related papers (2025-05-22T14:45:46Z) - A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance [3.4354636842203026]
We propose PGDA-RL, a primal-dual Projected Gradient Descent-Ascent algorithm for solving regularized Markov Decision Processes (MDPs)<n>PGDA-RL integrates experience replay-based gradient estimation with a two-timescale decomposition of the underlying nested optimization problem.<n>We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP.
arXiv Detail & Related papers (2025-05-07T15:18:43Z) - Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Semantic-Aware Remote Estimation of Multiple Markov Sources Under Constraints [9.514904359788156]
We exploit the emphsemantics of information and consider that the remote actuator has different tolerances for the estimation errors.<n>We find an optimal scheduling policy that minimizes the long-term textitstate-dependent costs of estimation errors under a transmission frequency constraint.
arXiv Detail & Related papers (2024-03-25T15:18:23Z) - Learning Logic Specifications for Policy Guidance in POMDPs: an
Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver.
We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications.
We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z) - Offline Reinforcement Learning via Linear-Programming with Error-Bound Induced Constraints [26.008426384903764]
offline reinforcement learning (RL) aims to find an optimal policy for Markov decision processes (MDPs) using a pre-collected dataset.<n>In this work, we revisit the linear programming (LP) reformulation of Markov decision processes for offline RL.
arXiv Detail & Related papers (2022-12-28T15:28:12Z) - Pessimistic Q-Learning for Offline Reinforcement Learning: Towards
Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes.
A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z) - Solving Multistage Stochastic Linear Programming via Regularized Linear
Decision Rules: An Application to Hydrothermal Dispatch Planning [77.34726150561087]
We propose a novel regularization scheme for linear decision rules (LDR) based on the AdaSO (adaptive least absolute shrinkage and selection operator)
Experiments show that the overfit threat is non-negligible when using the classical non-regularized LDR to solve MSLP.
For the LHDP problem, our analysis highlights the following benefits of the proposed framework in comparison to the non-regularized benchmark.
arXiv Detail & Related papers (2021-10-07T02:36:14Z) - Application-Driven Learning: A Closed-Loop Prediction and Optimization Approach Applied to Dynamic Reserves and Demand Forecasting [41.94295877935867]
We present application-driven learning, a new closed-loop framework in which the processes of forecasting and decision-making are merged and co-optimized.
We show that the proposed methodology is scalable and yields consistently better performance than the standard open-loop approach.
arXiv Detail & Related papers (2021-02-26T02:43:28Z) - A maximum-entropy approach to off-policy evaluation in average-reward
MDPs [54.967872716145656]
This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs)
We provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases.
We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning.
arXiv Detail & Related papers (2020-06-17T18:13:37Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z) - Counterfactual Learning of Stochastic Policies with Continuous Actions [42.903292639112536]
We introduce a modelling strategy based on a joint kernel embedding of contexts and actions.<n>We empirically show that the optimization aspect of counterfactual learning is important.<n>We propose an evaluation protocol for offline policies in real-world logged systems.
arXiv Detail & Related papers (2020-04-22T07:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.