ChronosPerseus: Randomized Point-based Value Iteration with Importance
Sampling for POSMDPs
- URL: http://arxiv.org/abs/2207.07825v1
- Date: Sat, 16 Jul 2022 03:31:47 GMT
- Title: ChronosPerseus: Randomized Point-based Value Iteration with Importance
Sampling for POSMDPs
- Authors: Richard Kohar, Fran\c{c}ois Rivest and Alain Gosselin
- Abstract summary: In reinforcement learning, agents have successfully used environments modeled with Markov decision processes (MDPs)
In many problem domains, an agent may suffer from noisy observations or random times until its subsequent decision.
We propose that partially observable semi-Markov decision processes (POSMDPs) can be helpful in dealing with the unknown time aspect.
- Score: 2.3204178451683264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning, agents have successfully used environments modeled
with Markov decision processes (MDPs). However, in many problem domains, an
agent may suffer from noisy observations or random times until its subsequent
decision. While partially observable Markov decision processes (POMDPs) have
dealt with noisy observations, they have yet to deal with the unknown time
aspect. Of course, one could discretize the time, but this leads to Bellman's
Curse of Dimensionality. To incorporate continuous sojourn-time distributions
in the agent's decision making, we propose that partially observable
semi-Markov decision processes (POSMDPs) can be helpful in this regard. We
extend \citet{Spaan2005a} randomized point-based value iteration (PBVI)
\textsc{Perseus} algorithm used for POMDP to POSMDP by incorporating continuous
sojourn time distributions and using importance sampling to reduce the solver
complexity. We call this new PBVI algorithm with importance sampling for
POSMDPs -- \textsc{ChronosPerseus}. This further allows for compressed complex
POMDPs requiring temporal state information by moving this information into
state sojourn time of a POMSDP. The second insight is that keeping a set of
sampled times and weighting it by its likelihood can be used in a single
backup; this helps further reduce the algorithm complexity. The solver also
works on episodic and non-episodic problems. We conclude our paper with two
examples, an episodic bus problem and a non-episodic maintenance problem.
Related papers
- Near-Optimal Learning and Planning in Separated Latent MDPs [70.88315649628251]
We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs)
In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs.
arXiv Detail & Related papers (2024-06-12T06:41:47Z) - Prospective Side Information for Latent MDPs [80.00842638151558]
We study the class of LMDPs with em prospective side information, when an agent receives additional, weakly revealing, information at the beginning of each episode.
We show that, surprisingly, this problem is not captured by contemporary settings and algorithms designed for partially observed environments.
We then establish that any sample efficient algorithm must suffer at least $Omega(K2/3)$-regret, as opposed to standard $Omega(sqrtK)$ lower bounds, and design an algorithm with a matching upper bound.
arXiv Detail & Related papers (2023-10-11T15:37:31Z) - Online POMDP Planning with Anytime Deterministic Guarantees [11.157761902108692]
Planning under uncertainty can be mathematically formalized using partially observable Markov decision processes (POMDPs)
Finding an optimal plan for POMDPs can be computationally expensive and is feasible only for small tasks.
We derive a deterministic relationship between a simplified solution that is easier to obtain and the theoretically optimal one.
arXiv Detail & Related papers (2023-10-03T04:40:38Z) - Twice Regularized Markov Decision Processes: The Equivalence between
Robustness and Regularization [64.60253456266872]
Markov decision processes (MDPs) aim to handle changing or partially known system dynamics.
Regularized MDPs show more stability in policy learning without impairing time complexity.
Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees.
arXiv Detail & Related papers (2023-03-12T13:03:28Z) - Intermittently Observable Markov Decision Processes [26.118176084782842]
We consider a scenario where the controller perceives the state information of the process via an unreliable communication channel.
The transmissions of state information over the whole time horizon are modeled as a Bernoulli lossy process.
We develop two finite-state approximations to the tree MDP to find near-optimal policies efficiently.
arXiv Detail & Related papers (2023-02-23T03:38:03Z) - Interval Markov Decision Processes with Continuous Action-Spaces [6.088695984060244]
We introduce continuous-action IMDPs (caIMDPs), where the bounds on transition probabilities are functions of the action variables.
We exploit the simple form of max problems to identify cases where value over caIMDPs can be solved efficiently.
We demonstrate our results on a numerical example.
arXiv Detail & Related papers (2022-11-02T16:11:51Z) - B$^3$RTDP: A Belief Branch and Bound Real-Time Dynamic Programming
Approach to Solving POMDPs [17.956744635160568]
We propose an extension to the RTDP-Bel algorithm which we call Belief Branch and Bound RTDP (B$3$RTDP)
Our algorithm uses a bounded value function representation and takes advantage of this in two novel ways.
We empirically demonstrate that B$3$RTDP can achieve greater returns in less time than the state-of-the-art SARSOP solver on known POMDP problems.
arXiv Detail & Related papers (2022-10-22T21:42:59Z) - Provable Reinforcement Learning with a Short-Term Memory [68.00677878812908]
We study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$.
In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially.
Our results show that a short-term memory suffices for reinforcement learning in these environments.
arXiv Detail & Related papers (2022-02-08T16:39:57Z) - Markov Decision Process modeled with Bandits for Sequential Decision
Making in Linear-flow [73.1896399783641]
In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence.
We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix.
We observe the proposed MDP with Bandits algorithm outperforms Q-learning with $epsilon$-greedy and decreasing $epsilon$, independent Bandits, and interaction Bandits.
arXiv Detail & Related papers (2021-07-01T03:54:36Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.