Towards Robust Off-policy Learning for Runtime Uncertainty
- URL: http://arxiv.org/abs/2202.13337v1
- Date: Sun, 27 Feb 2022 10:51:02 GMT
- Title: Towards Robust Off-policy Learning for Runtime Uncertainty
- Authors: Da Xu, Yuting Ye, Chuanwei Ruan, Bo Yang
- Abstract summary: Off-policy learning plays a pivotal role in optimizing and evaluating policies prior to the online deployment.
runtime uncertainty cannot be learned from the logged data due to its abnormality and rareness nature.
We bring runtime-uncertainty robustness to three major off-policy learning methods: the inverse propensity score method, reward-model method, and doubly robust method.
- Score: 28.425951919439783
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy learning plays a pivotal role in optimizing and evaluating
policies prior to the online deployment. However, during the real-time serving,
we observe varieties of interventions and constraints that cause inconsistency
between the online and offline settings, which we summarize and term as runtime
uncertainty. Such uncertainty cannot be learned from the logged data due to its
abnormality and rareness nature. To assert a certain level of robustness, we
perturb the off-policy estimators along an adversarial direction in view of the
runtime uncertainty. It allows the resulting estimators to be robust not only
to observed but also unexpected runtime uncertainties. Leveraging this idea, we
bring runtime-uncertainty robustness to three major off-policy learning
methods: the inverse propensity score method, reward-model method, and doubly
robust method. We theoretically justify the robustness of our methods to
runtime uncertainty, and demonstrate their effectiveness using both the
simulation and the real-world online experiments.
Related papers
- Uncertainty for Active Learning on Graphs [70.44714133412592]
Uncertainty Sampling is an Active Learning strategy that aims to improve the data efficiency of machine learning models.
We benchmark Uncertainty Sampling beyond predictive uncertainty and highlight a significant performance gap to other Active Learning strategies.
We develop ground-truth Bayesian uncertainty estimates in terms of the data generating process and prove their effectiveness in guiding Uncertainty Sampling toward optimal queries.
arXiv Detail & Related papers (2024-05-02T16:50:47Z) - One step closer to unbiased aleatoric uncertainty estimation [71.55174353766289]
We propose a new estimation method by actively de-noising the observed data.
By conducting a broad range of experiments, we demonstrate that our proposed approach provides a much closer approximation to the actual data uncertainty than the standard method.
arXiv Detail & Related papers (2023-12-16T14:59:11Z) - Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning [11.084321518414226]
We adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of so-called hindsight policy methods.
Our hindsight distribution correction facilitates stable, efficient learning across a broad range of environments where credit assignment plagues baseline methods.
arXiv Detail & Related papers (2023-07-21T20:54:52Z) - Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning [8.736154600219685]
Policy evaluation in online learning attracts increasing attention.
Yet, such a problem is particularly challenging due to the dependent data generated in the online environment.
We develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning.
arXiv Detail & Related papers (2021-10-29T02:38:54Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Deep Learning based Uncertainty Decomposition for Real-time Control [9.067368638784355]
We propose a novel method for detecting the absence of training data using deep learning.
We show its advantages over existing approaches on synthetic and real-world datasets.
We further demonstrate the practicality of this uncertainty estimate in deploying online data-efficient control on a simulated quadcopter.
arXiv Detail & Related papers (2020-10-06T10:46:27Z) - Temporal Difference Uncertainties as a Signal for Exploration [76.6341354269013]
An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy.
In this paper, we highlight that value estimates are easily biased and temporally inconsistent.
We propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors.
arXiv Detail & Related papers (2020-10-05T18:11:22Z) - Real-Time Uncertainty Estimation in Computer Vision via
Uncertainty-Aware Distribution Distillation [18.712408359052667]
We propose a simple, easy-to-optimize distillation method for learning the conditional predictive distribution of a pre-trained dropout model.
We empirically test the effectiveness of the proposed method on both semantic segmentation and depth estimation tasks.
arXiv Detail & Related papers (2020-07-31T05:40:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.