Related papers: ChronosPerseus: Randomized Point-based Value Iteration with Importance Sampling for POSMDPs

ChronosPerseus: Randomized Point-based Value Iteration with Importance Sampling for POSMDPs

URL: http://arxiv.org/abs/2207.07825v1
Date: Sat, 16 Jul 2022 03:31:47 GMT
Title: ChronosPerseus: Randomized Point-based Value Iteration with Importance Sampling for POSMDPs
Authors: Richard Kohar, Fran\c{c}ois Rivest and Alain Gosselin
Abstract summary: In reinforcement learning, agents have successfully used environments modeled with Markov decision processes (MDPs) In many problem domains, an agent may suffer from noisy observations or random times until its subsequent decision. We propose that partially observable semi-Markov decision processes (POSMDPs) can be helpful in dealing with the unknown time aspect.
Score: 2.3204178451683264
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In reinforcement learning, agents have successfully used environments modeled with Markov decision processes (MDPs). However, in many problem domains, an agent may suffer from noisy observations or random times until its subsequent decision. While partially observable Markov decision processes (POMDPs) have dealt with noisy observations, they have yet to deal with the unknown time aspect. Of course, one could discretize the time, but this leads to Bellman's Curse of Dimensionality. To incorporate continuous sojourn-time distributions in the agent's decision making, we propose that partially observable semi-Markov decision processes (POSMDPs) can be helpful in this regard. We extend \citet{Spaan2005a} randomized point-based value iteration (PBVI) \textsc{Perseus} algorithm used for POMDP to POSMDP by incorporating continuous sojourn time distributions and using importance sampling to reduce the solver complexity. We call this new PBVI algorithm with importance sampling for POSMDPs -- \textsc{ChronosPerseus}. This further allows for compressed complex POMDPs requiring temporal state information by moving this information into state sojourn time of a POMSDP. The second insight is that keeping a set of sampled times and weighting it by its likelihood can be used in a single backup; this helps further reduce the algorithm complexity. The solver also works on episodic and non-episodic problems. We conclude our paper with two examples, an episodic bus problem and a non-episodic maintenance problem.

Related papers

Observation Adaptation via Annealed Importance Resampling for Partially Observable Markov Decision Processes [4.830416359005018]
Partially observable Markov decision processes (POMDPs) are a general mathematical model for sequential decision-making in environments under state uncertainty. Online solvers typically use bootstrap particle filters based on importance resampling for updating the belief distribution. We propose an approach that constructs a sequence of bridge distributions between the state-transition and optimal distributions through iterative Monte Carlo steps.
arXiv Detail & Related papers (2025-03-25T03:05:00Z)
Near-Optimal Learning and Planning in Separated Latent MDPs [70.88315649628251]
We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs) In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs.
arXiv Detail & Related papers (2024-06-12T06:41:47Z)
Prospective Side Information for Latent MDPs [80.00842638151558]
We study the class of LMDPs with em prospective side information, when an agent receives additional, weakly revealing, information at the beginning of each episode. We show that, surprisingly, this problem is not captured by contemporary settings and algorithms designed for partially observed environments. We then establish that any sample efficient algorithm must suffer at least $Omega(K2/3)$-regret, as opposed to standard $Omega(sqrtK)$ lower bounds, and design an algorithm with a matching upper bound.
arXiv Detail & Related papers (2023-10-11T15:37:31Z)
Online POMDP Planning with Anytime Deterministic Guarantees [11.157761902108692]
Planning under uncertainty can be mathematically formalized using partially observable Markov decision processes (POMDPs) Finding an optimal plan for POMDPs can be computationally expensive and is feasible only for small tasks. We derive a deterministic relationship between a simplified solution that is easier to obtain and the theoretically optimal one.
arXiv Detail & Related papers (2023-10-03T04:40:38Z)
Twice Regularized Markov Decision Processes: The Equivalence between Robustness and Regularization [64.60253456266872]
Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. Regularized MDPs show more stability in policy learning without impairing time complexity. Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees.
arXiv Detail & Related papers (2023-03-12T13:03:28Z)
Intermittently Observable Markov Decision Processes [26.118176084782842]
We consider a scenario where the controller perceives the state information of the process via an unreliable communication channel. The transmissions of state information over the whole time horizon are modeled as a Bernoulli lossy process. We develop two finite-state approximations to the tree MDP to find near-optimal policies efficiently.
arXiv Detail & Related papers (2023-02-23T03:38:03Z)
Interval Markov Decision Processes with Continuous Action-Spaces [6.088695984060244]
We introduce continuous-action IMDPs (caIMDPs), where the bounds on transition probabilities are functions of the action variables. We exploit the simple form of max problems to identify cases where value over caIMDPs can be solved efficiently. We demonstrate our results on a numerical example.
arXiv Detail & Related papers (2022-11-02T16:11:51Z)
B$^3$RTDP: A Belief Branch and Bound Real-Time Dynamic Programming Approach to Solving POMDPs [17.956744635160568]
We propose an extension to the RTDP-Bel algorithm which we call Belief Branch and Bound RTDP (B$3$RTDP) Our algorithm uses a bounded value function representation and takes advantage of this in two novel ways. We empirically demonstrate that B$3$RTDP can achieve greater returns in less time than the state-of-the-art SARSOP solver on known POMDP problems.
arXiv Detail & Related papers (2022-10-22T21:42:59Z)
Provable Reinforcement Learning with a Short-Term Memory [68.00677878812908]
We study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$. In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially. Our results show that a short-term memory suffices for reinforcement learning in these environments.
arXiv Detail & Related papers (2022-02-08T16:39:57Z)
Sampling-Based Robust Control of Autonomous Systems with Non-Gaussian Noise [59.47042225257565]
We present a novel planning method that does not rely on any explicit representation of the noise distributions. First, we abstract the continuous system into a discrete-state model that captures noise by probabilistic transitions between states. We capture these bounds in the transition probability intervals of a so-called interval Markov decision process (iMDP)
arXiv Detail & Related papers (2021-10-25T06:18:55Z)
Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow [73.1896399783641]
In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence. We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix. We observe the proposed MDP with Bandits algorithm outperforms Q-learning with $epsilon$-greedy and decreasing $epsilon$, independent Bandits, and interaction Bandits.
arXiv Detail & Related papers (2021-07-01T03:54:36Z)
Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes. Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z)
An On-Line POMDP Solver for Continuous Observation Spaces [5.482532589225552]
We propose a new on-line POMDP solver, called Lazy Belief Extraction for Continuous POMDPs (LABECOP) It combines methods from Monte-Carlo-Tree-Search and particle filtering to construct a policy reprentation which doesn't require discretised observation spaces. Experiments on three different problems involving continuous observation spaces indicate that LABECOP performs similar or better than state-of-the-art POMDP solvers.
arXiv Detail & Related papers (2020-11-04T00:16:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.