Sales Channel Optimization via Simulations Based on Observational Data
with Delayed Rewards: A Case Study at LinkedIn
- URL: http://arxiv.org/abs/2209.07749v1
- Date: Fri, 16 Sep 2022 07:08:37 GMT
- Title: Sales Channel Optimization via Simulations Based on Observational Data
with Delayed Rewards: A Case Study at LinkedIn
- Authors: Diana M. Negoescu, Pasha Khosravi, Shadow Zhao, Nanyu Chen, Parvez
Ahammad, Humberto Gonzalez
- Abstract summary: Training models on data obtained from randomized experiments is ideal for making good decisions.
However, randomized experiments are often time-consuming, expensive, risky, infeasible or unethical to perform.
We build a discrete-time simulation that can handle our problem features and used it to evaluate different policies.
Our simulation results indicate that LinUCB, a simple MAB policy, consistently outperforms the other policies.
- Score: 4.6405223560607105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training models on data obtained from randomized experiments is ideal for
making good decisions. However, randomized experiments are often
time-consuming, expensive, risky, infeasible or unethical to perform, leaving
decision makers little choice but to rely on observational data collected under
historical policies when training models. This opens questions regarding not
only which decision-making policies would perform best in practice, but also
regarding the impact of different data collection protocols on the performance
of various policies trained on the data, or the robustness of policy
performance with respect to changes in problem characteristics such as action-
or reward- specific delays in observing outcomes. We aim to answer such
questions for the problem of optimizing sales channel allocations at LinkedIn,
where sales accounts (leads) need to be allocated to one of three channels,
with the goal of maximizing the number of successful conversions over a period
of time. A key problem feature constitutes the presence of stochastic delays in
observing allocation outcomes, whose distribution is both channel- and outcome-
dependent. We built a discrete-time simulation that can handle our problem
features and used it to evaluate: a) a historical rule-based policy; b) a
supervised machine learning policy (XGBoost); and c) multi-armed bandit (MAB)
policies, under different scenarios involving: i) data collection used for
training (observational vs randomized); ii) lead conversion scenarios; iii)
delay distributions. Our simulation results indicate that LinUCB, a simple MAB
policy, consistently outperforms the other policies, achieving a 18-47% lift
relative to a rule-based policy
Related papers
- Stochastic Gradient Descent with Adaptive Data [4.119418481809095]
gradient descent (SGD) is a powerful optimization technique that is particularly useful in online learning scenarios.
Applying SGD to policy optimization problems in operations research involves a distinct challenge: the policy changes the environment and thereby affects the data used to update the policy.
The influence of previous decisions on the data generated introduces bias in the gradient estimate, which presents a potential source of instability for online learning not present in the iid case.
We show that the convergence speed of SGD with adaptive data is largely similar to the classical iid setting, as long as the mixing time of the policy-induced dynamics is factored in.
arXiv Detail & Related papers (2024-10-02T02:58:32Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data [17.991833729722288]
We propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL)
Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function.
We provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.
arXiv Detail & Related papers (2024-03-18T14:51:19Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Model Selection in Batch Policy Optimization [88.52887493684078]
We study the problem of model selection in batch policy optimization.
We identify three sources of error that any model selection algorithm should optimally trade-off in order to be competitive.
arXiv Detail & Related papers (2021-12-23T02:31:50Z) - Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.