Exploration in Model-based Reinforcement Learning with Randomized Reward
- URL: http://arxiv.org/abs/2301.03142v1
- Date: Mon, 9 Jan 2023 01:50:55 GMT
- Title: Exploration in Model-based Reinforcement Learning with Randomized Reward
- Authors: Lingxiao Wang and Ping Li
- Abstract summary: We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism.
We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration.
- Score: 40.87376174638752
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model-based Reinforcement Learning (MBRL) has been widely adapted due to its
sample efficiency. However, existing worst-case regret analysis typically
requires optimistic planning, which is not realistic in general. In contrast,
motivated by the theory, empirical study utilizes ensemble of models, which
achieve state-of-the-art performance on various testing environments. Such
deviation between theory and empirical study leads us to question whether
randomized model ensemble guarantee optimism, and hence the optimal worst-case
regret? This paper partially answers such question from the perspective of
reward randomization, a scarcely explored direction of exploration with MBRL.
We show that under the kernelized linear regulator (KNR) model, reward
randomization guarantees a partial optimism, which further yields a
near-optimal worst-case regret in terms of the number of interactions. We
further extend our theory to generalized function approximation and identified
conditions for reward randomization to attain provably efficient exploration.
Correspondingly, we propose concrete examples of efficient reward
randomization. To the best of our knowledge, our analysis establishes the first
worst-case regret analysis on randomized MBRL with function approximation.
Related papers
- Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases [76.9127853906115]
Bridging the gap between diffusion models and human preferences is crucial for their integration into practical generative.
We propose Temporal Diffusion Policy Optimization with critic active neuron Reset (TDPO-R), a policy gradient algorithm that exploits the temporal inductive bias of diffusion models.
Empirical results demonstrate the superior efficacy of our methods in mitigating reward overoptimization.
arXiv Detail & Related papers (2024-02-13T15:55:41Z) - Bayesian Nonparametrics Meets Data-Driven Distributionally Robust Optimization [29.24821214671497]
Training machine learning and statistical models often involve optimizing a data-driven risk criterion.
We propose a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet process) theory and a recent decision-theoretic model of smooth ambiguity-averse preferences.
For practical implementation, we propose and study tractable approximations of the criterion based on well-known Dirichlet process representations.
arXiv Detail & Related papers (2024-01-28T21:19:15Z) - High Precision Causal Model Evaluation with Conditional Randomization [10.23470075454725]
We introduce a novel low-variance estimator for causal error, dubbed as the pairs estimator.
By applying the same IPW estimator to both the model and true experimental effects, our estimator effectively cancels out the variance due to IPW and achieves a smaller variance.
Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself.
arXiv Detail & Related papers (2023-11-03T13:22:27Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Understanding the stochastic dynamics of sequential decision-making
processes: A path-integral analysis of multi-armed bandits [7.05949591248206]
The multi-armed bandit (MAB) model is one of the most popular models to study decision-making in an uncertain environment.
In this paper, we employ techniques in statistical physics to analyze the MAB model.
arXiv Detail & Related papers (2022-08-11T09:32:03Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Refined bounds for randomized experimental design [7.899055512130904]
Experimental design is an approach for selecting samples among a given set so as to obtain the best estimator for a given criterion.
We propose theoretical guarantees for randomized strategies on E and G-optimal design.
arXiv Detail & Related papers (2020-12-22T20:37:57Z) - Robust Sampling in Deep Learning [62.997667081978825]
Deep learning requires regularization mechanisms to reduce overfitting and improve generalization.
We address this problem by a new regularization method based on distributional robust optimization.
During the training, the selection of samples is done according to their accuracy in such a way that the worst performed samples are the ones that contribute the most in the optimization.
arXiv Detail & Related papers (2020-06-04T09:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.