Online POMDP Planning via Simplification
- URL: http://arxiv.org/abs/2105.05296v1
- Date: Tue, 11 May 2021 18:46:08 GMT
- Title: Online POMDP Planning via Simplification
- Authors: Ori Sztyglic and Vadim Indelman
- Abstract summary: We develop a novel approach to POMDP planning considering belief-dependent rewards.
Our approach is guaranteed to find the optimal solution of the original problem but with substantial speedup.
We validate our approach in simulation using these bounds and where simplification corresponds to reducing the number of samples, exhibiting a significant computational speedup.
- Score: 10.508187462682306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we consider online planning in partially observable domains.
Solving the corresponding POMDP problem is a very challenging task,
particularly in an online setting. Our key contribution is a novel algorithmic
approach, Simplified Information Theoretic Belief Space Planning (SITH-BSP),
which aims to speed-up POMDP planning considering belief-dependent rewards,
without compromising on the solution's accuracy. We do so by mathematically
relating the simplified elements of the problem to the corresponding
counterparts of the original problem. Specifically, we focus on belief
simplification and use it to formulate bounds on the corresponding original
belief-dependent rewards. These bounds in turn are used to perform branch
pruning over the belief tree, in the process of calculating the optimal policy.
We further introduce the notion of adaptive simplification, while re-using
calculations between different simplification levels and exploit it to prune,
at each level in the belief tree, all branches but one. Therefore, our approach
is guaranteed to find the optimal solution of the original problem but with
substantial speedup. As a second key contribution, we derive novel analytical
bounds for differential entropy, considering a sampling-based belief
representation, which we believe are of interest on their own. We validate our
approach in simulation using these bounds and where simplification corresponds
to reducing the number of samples, exhibiting a significant computational
speedup while yielding the optimal solution.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - No Compromise in Solution Quality: Speeding Up Belief-dependent Continuous POMDPs via Adaptive Multilevel Simplification [6.300736240833814]
Continuous POMDPs with general belief-dependent rewards are notoriously difficult to solve online.
We present a complete provable theory of adaptive multilevel simplification for the setting of a given externally constructed belief tree.
We present three algorithms to accelerate continuous POMDP online planning with belief-dependent rewards.
arXiv Detail & Related papers (2023-10-16T10:59:22Z) - Online POMDP Planning with Anytime Deterministic Guarantees [11.157761902108692]
Planning under uncertainty can be mathematically formalized using partially observable Markov decision processes (POMDPs)
Finding an optimal plan for POMDPs can be computationally expensive and is feasible only for small tasks.
We derive a deterministic relationship between a simplified solution that is easier to obtain and the theoretically optimal one.
arXiv Detail & Related papers (2023-10-03T04:40:38Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - Non-stationary Reinforcement Learning under General Function
Approximation [60.430936031067006]
We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs.
Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA.
We show that SW-OPEA is provably efficient as long as the variation budget is not significantly large.
arXiv Detail & Related papers (2023-06-01T16:19:37Z) - Simplified Continuous High Dimensional Belief Space Planning with
Adaptive Probabilistic Belief-dependent Constraints [9.061408029414453]
Under uncertainty in partially observable domains, also known as Belief Space Planning, online decision making is a fundamental problem.
We present a technique to adaptively accept or discard a candidate action sequence with respect to a probabilistic belief-dependent constraint.
We apply our method to active SLAM, a highly challenging problem of high dimensional Belief Space Planning.
arXiv Detail & Related papers (2023-02-13T21:22:47Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Sequential Information Design: Markov Persuasion Process and Its
Efficient Reinforcement Learning [156.5667417159582]
This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs)
Planning in MPPs faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender.
We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles.
arXiv Detail & Related papers (2022-02-22T05:41:43Z) - Outlier-Robust Sparse Estimation via Non-Convex Optimization [73.18654719887205]
We explore the connection between high-dimensional statistics and non-robust optimization in the presence of sparsity constraints.
We develop novel and simple optimization formulations for these problems.
As a corollary, we obtain that any first-order method that efficiently converges to station yields an efficient algorithm for these tasks.
arXiv Detail & Related papers (2021-09-23T17:38:24Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - FedSplit: An algorithmic framework for fast federated optimization [40.42352500741025]
We introduce FedSplit, a class of algorithms for solving distributed convex minimization with additive structure.
Our theory shows that these methods are provably robust to inexact computation of intermediate local quantities.
arXiv Detail & Related papers (2020-05-11T16:30:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.