Related papers: On Reward Structures of Markov Decision Processes

On Reward Structures of Markov Decision Processes

URL: http://arxiv.org/abs/2308.14919v2
Date: Thu, 31 Aug 2023 22:16:43 GMT
Title: On Reward Structures of Markov Decision Processes
Authors: Falcon Z. Dai
Abstract summary: A Markov decision process can be parameterized by a transition kernel and a reward function. We study various kinds of "costs" associated with reinforcement learning inspired by the demands in robotic applications. We develop a novel estimator with an instance-specific error bound to $tildeO(sqrtfractau_sn)$ for estimating a single state value.
Score: 4.13365552362244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A Markov decision process can be parameterized by a transition kernel and a reward function. Both play essential roles in the study of reinforcement learning as evidenced by their presence in the Bellman equations. In our inquiry of various kinds of "costs" associated with reinforcement learning inspired by the demands in robotic applications, rewards are central to understanding the structure of a Markov decision process and reward-centric notions can elucidate important concepts in reinforcement learning. Specifically, we study the sample complexity of policy evaluation and develop a novel estimator with an instance-specific error bound of $\tilde{O}(\sqrt{\frac{\tau_s}{n}})$ for estimating a single state value. Under the online regret minimization setting, we refine the transition-based MDP constant, diameter, into a reward-based constant, maximum expected hitting cost, and with it, provide a theoretical explanation for how a well-known technique, potential-based reward shaping, could accelerate learning with expert knowledge. In an attempt to study safe reinforcement learning, we model hazardous environments with irrecoverability and proposed a quantitative notion of safe learning via reset efficiency. In this setting, we modify a classic algorithm to account for resets achieving promising preliminary numerical results. Lastly, for MDPs with multiple reward functions, we develop a planning algorithm that computationally efficiently finds Pareto-optimal stochastic policies.

Related papers

Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes [7.028778922533688]
Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. We study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning.
arXiv Detail & Related papers (2024-10-14T14:52:23Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective. We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z)
Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z)
Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm [4.128216503196621]
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner. We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
arXiv Detail & Related papers (2022-10-14T06:53:02Z)
Weighted Maximum Entropy Inverse Reinforcement Learning [22.269565708490468]
We study inverse reinforcement learning (IRL) and imitation learning (IM) We propose a new way to improve the learning process by adding the maximum weight function to the entropy framework. Our framework and algorithms allow to learn both a reward (or policy) function and the structure of the entropy terms added to the Markov Decision Processes.
arXiv Detail & Related papers (2022-08-20T06:02:07Z)
Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off. We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z)
Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL. Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network. We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z)
Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications [0.0]
Reinforcement learning aims at finding the best stationary policy for a given Markov Decision Process. This paper provides deep theoretical insights to the widely applied standard discounted reinforcement learning framework. We establish a novel near-Blackwell-optimal reinforcement learning algorithm.
arXiv Detail & Related papers (2020-04-02T08:05:18Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.