Sayer: Using Implicit Feedback to Optimize System Policies
- URL: http://arxiv.org/abs/2110.14874v1
- Date: Thu, 28 Oct 2021 04:16:56 GMT
- Title: Sayer: Using Implicit Feedback to Optimize System Policies
- Authors: Mathias L\'ecuyer, Sang Hoon Kim, Mihir Nanavati, Junchen Jiang,
Siddhartha Sen, Amit Sharma, Aleksandrs Slivkins
- Abstract summary: We develop a methodology that leverages implicit feedback to evaluate and train new system policies.
Sayer builds on two ideas from reinforcement learning to leverage data collected by an existing policy.
We show that Sayer can evaluate arbitrary policies accurately, and train new policies that outperform the production policies.
- Score: 63.992191765269396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We observe that many system policies that make threshold decisions involving
a resource (e.g., time, memory, cores) naturally reveal additional, or implicit
feedback. For example, if a system waits X min for an event to occur, then it
automatically learns what would have happened if it waited <X min, because time
has a cumulative property. This feedback tells us about alternative decisions,
and can be used to improve the system policy. However, leveraging implicit
feedback is difficult because it tends to be one-sided or incomplete, and may
depend on the outcome of the event. As a result, existing practices for using
feedback, such as simply incorporating it into a data-driven model, suffer from
bias.
We develop a methodology, called Sayer, that leverages implicit feedback to
evaluate and train new system policies. Sayer builds on two ideas from
reinforcement learning -- randomized exploration and unbiased counterfactual
estimators -- to leverage data collected by an existing policy to estimate the
performance of new candidate policies, without actually deploying those
policies. Sayer uses implicit exploration and implicit data augmentation to
generate implicit feedback in an unbiased form, which is then used by an
implicit counterfactual estimator to evaluate and train new policies. The key
idea underlying these techniques is to assign implicit probabilities to
decisions that are not actually taken but whose feedback can be inferred; these
probabilities are carefully calculated to ensure statistical unbiasedness. We
apply Sayer to two production scenarios in Azure, and show that it can evaluate
arbitrary policies accurately, and train new policies that outperform the
production policies.
Related papers
- Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Universal Off-Policy Evaluation [64.02853483874334]
We take the first steps towards a universal off-policy estimator (UnO)
We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns.
arXiv Detail & Related papers (2021-04-26T18:54:31Z) - Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior
Policies [3.855085732184416]
Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies.
This paper discusses how to correctly mix estimators produced by different behavior policies.
Experiments on simulated recommender systems show that our methods are effective in reducing the Mean-Square Error of estimation.
arXiv Detail & Related papers (2020-11-29T12:57:54Z) - Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Efficient Policy Learning from Surrogate-Loss Classification Reductions [65.91730154730905]
We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning.
We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters.
We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.