Related papers: Generalization in Monitored Markov Decision Processes (Mon-MDPs)

Generalization in Monitored Markov Decision Processes (Mon-MDPs)

URL: http://arxiv.org/abs/2505.08988v1
Date: Tue, 13 May 2025 21:58:25 GMT
Title: Generalization in Monitored Markov Decision Processes (Mon-MDPs)
Authors: Montaser Mohammedalamen, Michael Bowling,
Abstract summary: In many real-world scenarios, rewards are not always observable, which can be modeled as a monitored Markov decision process (Mon-MDP)<n>This work explores Mon-MDPs using function approximation (FA) and investigates the challenges involved.<n>We show that combining function approximation with a learned reward model enables agents to generalize from monitored states with observable rewards, to unmonitored environment states with unobservable rewards.
Score: 9.81003561034599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) typically models the interaction between the agent and environment as a Markov decision process (MDP), where the rewards that guide the agent's behavior are always observable. However, in many real-world scenarios, rewards are not always observable, which can be modeled as a monitored Markov decision process (Mon-MDP). Prior work on Mon-MDPs have been limited to simple, tabular cases, restricting their applicability to real-world problems. This work explores Mon-MDPs using function approximation (FA) and investigates the challenges involved. We show that combining function approximation with a learned reward model enables agents to generalize from monitored states with observable rewards, to unmonitored environment states with unobservable rewards. Therefore, we demonstrate that such generalization with a reward model achieves near-optimal policies in environments formally defined as unsolvable. However, we identify a critical limitation of such function approximation, where agents incorrectly extrapolate rewards due to overgeneralization, resulting in undesirable behaviors. To mitigate overgeneralization, we propose a cautious police optimization method leveraging reward uncertainty. This work serves as a step towards bridging this gap between Mon-MDP theory and real-world applications.

Related papers

Dynamic and Generalizable Process Reward Modeling [74.36829922727026]
We propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria.<n> Experimental results show that DG-PRM achieves stunning performance on prevailing benchmarks, significantly boosting model performance across tasks with dense rewards.
arXiv Detail & Related papers (2025-07-23T18:17:22Z)
Deontically Constrained Policy Improvement in Reinforcement Learning Agents [0.0]
Markov Decision Processes (MDPs) are the most common model for decision making under uncertainty in the Machine Learning community.<n>An MDP captures non-determinism, probabilistic uncertainty, and an explicit model of action.<n>A Reinforcement Learning (RL) agent learns to act in an MDP by maximizing a utility function.
arXiv Detail & Related papers (2025-06-08T01:01:06Z)
Model-Based Exploration in Monitored Markov Decision Processes [15.438015964569743]
Monitored Markov decision processes (Mon-MDPs) have recently been proposed as a model of such settings.<n>Mon-MDP algorithms developed thus far do not fully exploit the problem structure.<n>We introduce a model-based algorithm for Mon-MDPs that addresses all of these shortcomings.
arXiv Detail & Related papers (2025-02-24T01:35:32Z)
R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models [50.19174067263255]
We introduce prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. We show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate.
arXiv Detail & Related papers (2024-09-21T18:32:44Z)
Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning [0.0]
We introduce a general mapping of non-cumulative Markov decision processes to standard MDPs.<n>This allows all techniques developed to find optimal policies for MDPs to be directly applied to the larger class of NCMDPs.<n>We show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems.
arXiv Detail & Related papers (2024-05-22T13:01:37Z)
Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization [39.740287682191884]
In robust Markov decision processes (RMDPs) it is assumed that the reward and the transition dynamics lie in a given uncertainty set. This so-called rectangularity condition is solely motivated by computational concerns. We introduce a policy-gradient method and prove its convergence.
arXiv Detail & Related papers (2023-09-03T07:34:26Z)
Twice Regularized Markov Decision Processes: The Equivalence between Robustness and Regularization [64.60253456266872]
Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. Regularized MDPs show more stability in policy learning without impairing time complexity. Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees.
arXiv Detail & Related papers (2023-03-12T13:03:28Z)
Explainable Reinforcement Learning via Model Transforms [18.385505289067023]
We argue that even if the underlying Markov Decision Process is not fully known, it can nevertheless be exploited to automatically generate explanations. We suggest using formal MDP abstractions and transforms, previously used in the literature for expediting the search for optimal policies, to automatically produce explanations.
arXiv Detail & Related papers (2022-09-24T13:18:06Z)
Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability [92.95794652625496]
Generalization is a central challenge for the deployment of reinforcement learning systems. We show that generalization to unseen test conditions from a limited number of training conditions induces implicit partial observability. We recast the problem of generalization in RL as solving the induced partially observed Markov decision process.
arXiv Detail & Related papers (2021-07-13T17:59:25Z)
Reward is enough for convex MDPs [30.478950691312715]
We study convex MDPs in which goals are expressed as convex functions of the stationary distribution. We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.
arXiv Detail & Related papers (2021-06-01T17:46:25Z)
Modular Deep Reinforcement Learning for Continuous Motion Planning with Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP) The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP. The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z)
Maximizing Information Gain in Partially Observable Environments via Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent. We derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
Invariant Causal Prediction for Block MDPs [106.63346115341862]
Generalization across environments is critical to the successful application of reinforcement learning algorithms to real-world challenges. We propose a method of invariant prediction to learn model-irrelevance state abstractions (MISA) that generalize to novel observations in the multi-environment setting.
arXiv Detail & Related papers (2020-03-12T21:03:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.