Bayesian regularization of empirical MDPs
- URL: http://arxiv.org/abs/2208.02362v1
- Date: Wed, 3 Aug 2022 22:02:50 GMT
- Title: Bayesian regularization of empirical MDPs
- Authors: Samarth Gupta, Daniel N. Hill, Lexing Ying, Inderjit Dhillon
- Abstract summary: We take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information.
We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shopping store.
- Score: 11.3458118258705
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In most applications of model-based Markov decision processes, the parameters
for the unknown underlying model are often estimated from the empirical data.
Due to noise, the policy learnedfrom the estimated model is often far from the
optimal policy of the underlying model. When applied to the environment of the
underlying model, the learned policy results in suboptimal performance, thus
calling for solutions with better generalization performance. In this work we
take a Bayesian perspective and regularize the objective function of the Markov
decision process with prior information in order to obtain more robust
policies. Two approaches are proposed, one based on $L^1$ regularization and
the other on relative entropic regularization. We evaluate our proposed
algorithms on synthetic simulations and on real-world search logs of a large
scale online shopping store. Our results demonstrate the robustness of
regularized MDP policies against the noise present in the models.
Related papers
- 1-2-3-Go! Policy Synthesis for Parameterized Markov Decision Processes via Decision-Tree Learning and Generalization [0.8795040582681393]
In particular, the state space often becomes extremely large when instantiating parameterized Markov decision processes.
We propose a learning-based approach to obtain a reasonable policy for such huge MDPs.
arXiv Detail & Related papers (2024-10-23T21:57:05Z) - Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms [34.593772931446125]
monograph focuses on the exploration of various model-based and model-free approaches for Constrained within the context of average reward Markov Decision Processes (MDPs)
The primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs.
arXiv Detail & Related papers (2024-06-17T12:46:02Z) - Sample Complexity Characterization for Linear Contextual MDPs [67.79455646673762]
Contextual decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable.
CMDPs serve as an important framework to model many real-world applications with time-varying environments.
We study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights.
arXiv Detail & Related papers (2024-02-05T03:25:04Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Sample Complexity of Robust Reinforcement Learning with a Generative
Model [0.0]
We propose a model-based reinforcement learning (RL) algorithm for learning an $epsilon$-optimal robust policy.
We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence.
In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies.
arXiv Detail & Related papers (2021-12-02T18:55:51Z) - Learning Robust Controllers Via Probabilistic Model-Based Policy Search [2.886634516775814]
We investigate whether controllers learned in such a way are robust and able to generalize under small perturbations of the environment.
We show that enforcing a lower bound to the likelihood noise in the Gaussian Process dynamics model regularizes the policy updates and yields more robust controllers.
arXiv Detail & Related papers (2021-10-26T11:17:31Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Bayes-Adaptive Deep Model-Based Policy Optimisation [4.675381958034012]
We introduce a Bayesian (deep) model-based reinforcement learning method (RoMBRL) that can capture model uncertainty to achieve sample-efficient policy optimisation.
We propose to formulate the model-based policy optimisation problem as a Bayes-adaptive Markov decision process (BAMDP)
We show that RoMBRL outperforms existing approaches on many challenging control benchmark tasks in terms of sample complexity and task performance.
arXiv Detail & Related papers (2020-10-29T21:17:25Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.