Counterfactual Evaluation of Slate Recommendations with Sequential
Reward Interactions
- URL: http://arxiv.org/abs/2007.12986v2
- Date: Mon, 24 Aug 2020 01:34:40 GMT
- Title: Counterfactual Evaluation of Slate Recommendations with Sequential
Reward Interactions
- Authors: James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, Ben
Carterette
- Abstract summary: Music streaming, video streaming, news recommendation, and e-commerce services often engage with content in a sequential manner.
Providing and evaluating good sequences of recommendations is therefore a central problem for these services.
We propose a new counterfactual estimator that allows for sequential interactions in the rewards with lower variance in anally unbiased manner.
- Score: 18.90946044396516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Users of music streaming, video streaming, news recommendation, and
e-commerce services often engage with content in a sequential manner. Providing
and evaluating good sequences of recommendations is therefore a central problem
for these services. Prior reweighting-based counterfactual evaluation methods
either suffer from high variance or make strong independence assumptions about
rewards. We propose a new counterfactual estimator that allows for sequential
interactions in the rewards with lower variance in an asymptotically unbiased
manner. Our method uses graphical assumptions about the causal relationships of
the slate to reweight the rewards in the logging policy in a way that
approximates the expected sum of rewards under the target policy. Extensive
experiments in simulation and on a live recommender system show that our
approach outperforms existing methods in terms of bias and data efficiency for
the sequential track recommendations problem.
Related papers
- Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach [13.208141830901845]
We show that the standard difference-in-means estimator can lead to biased estimates due to recommender interference.
We propose a "recommender choice model" that describes which item gets exposed from a pool containing both treated and control items.
We show that the proposed estimator yields results comparable to the benchmark, whereas the standard difference-in-means estimator can exhibit significant bias and even produce reversed signs.
arXiv Detail & Related papers (2024-06-20T14:53:26Z) - Provable Benefits of Policy Learning from Human Preferences in
Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT.
We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Reward Imputation with Sketching for Contextual Batched Bandits [48.80803376405073]
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode.
Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information.
We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
arXiv Detail & Related papers (2022-10-13T04:26:06Z) - Breaking Feedback Loops in Recommender Systems with Causal Inference [99.22185950608838]
Recent work has shown that feedback loops may compromise recommendation quality and homogenize user behavior.
We propose the Causal Adjustment for Feedback Loops (CAFL), an algorithm that provably breaks feedback loops using causal inference.
We show that CAFL improves recommendation quality when compared to prior correction methods.
arXiv Detail & Related papers (2022-07-04T17:58:39Z) - Long-term Dynamics of Fairness Intervention in Connection Recommender
Systems [5.048563042541915]
We study a connection recommender system patterned after the systems employed by web-scale social networks.
We find that, although seemingly fair in aggregate, common exposure and utility parity interventions fail to mitigate amplification of biases in the long term.
arXiv Detail & Related papers (2022-03-30T16:27:48Z) - Correcting the User Feedback-Loop Bias for Recommendation Systems [34.44834423714441]
We propose a systematic and dynamic way to correct user feedback-loop bias in recommendation systems.
Our method includes a deep-learning component to learn each user's dynamic rating history embedding.
We empirically validated the existence of such user feedback-loop bias in real world recommendation systems.
arXiv Detail & Related papers (2021-09-13T15:02:55Z) - Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions.
We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z) - Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior
Policies [3.855085732184416]
Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies.
This paper discusses how to correctly mix estimators produced by different behavior policies.
Experiments on simulated recommender systems show that our methods are effective in reducing the Mean-Square Error of estimation.
arXiv Detail & Related papers (2020-11-29T12:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.