Related papers: Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions

Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions

URL: http://arxiv.org/abs/2007.12986v2
Date: Mon, 24 Aug 2020 01:34:40 GMT
Title: Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions
Authors: James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, Ben Carterette
Abstract summary: Music streaming, video streaming, news recommendation, and e-commerce services often engage with content in a sequential manner. Providing and evaluating good sequences of recommendations is therefore a central problem for these services. We propose a new counterfactual estimator that allows for sequential interactions in the rewards with lower variance in anally unbiased manner.
Score: 18.90946044396516
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Users of music streaming, video streaming, news recommendation, and e-commerce services often engage with content in a sequential manner. Providing and evaluating good sequences of recommendations is therefore a central problem for these services. Prior reweighting-based counterfactual evaluation methods either suffer from high variance or make strong independence assumptions about rewards. We propose a new counterfactual estimator that allows for sequential interactions in the rewards with lower variance in an asymptotically unbiased manner. Our method uses graphical assumptions about the causal relationships of the slate to reweight the rewards in the logging policy in a way that approximates the expected sum of rewards under the target policy. Extensive experiments in simulation and on a live recommender system show that our approach outperforms existing methods in terms of bias and data efficiency for the sequential track recommendations problem.

Related papers

Off-Policy Evaluation and Learning for Matching Markets [15.585641615174623]
Off-Policy Evaluation (OPE) plays a crucial role by enabling the evaluation of recommendation policies using only offline logged data.<n>We propose novel OPE estimators, textitDiPS and textitDPR, specifically designed for matching markets.<n>Our methods combine elements of the Direct Method (DM), Inverse Propensity Score (IPS), and Doubly Robust (DR) estimators.
arXiv Detail & Related papers (2025-07-18T02:23:37Z)
Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning [50.93804891554481]
We introduce a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators.<n>Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions.<n>In the off-policy learning scenario, we establish bounds on the regret -- the performance gap between our LSE estimator and the optimal policy.
arXiv Detail & Related papers (2025-06-07T17:37:10Z)
Off-Policy Evaluation for Recommendations with Missing-Not-At-Random Rewards [0.0]
Unbiased recommender learning (URL) and off-policy evaluation/learning (OPE/L) techniques are effective in addressing the data bias caused by display position and logging policies. However, when both bias exits in the logged data, these estimators may suffer from significant bias. We propose a novel estimator that leverages two probabilities of logging policies and reward observations as propensity scores.
arXiv Detail & Related papers (2025-02-13T06:11:29Z)
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Learning Recommender Systems with Soft Target: A Decoupled Perspective [49.83787742587449]
We propose a novel decoupled soft label optimization framework to consider the objectives as two aspects by leveraging soft labels. We present a sensible soft-label generation algorithm that models a label propagation algorithm to explore users' latent interests in unobserved feedback via neighbors.
arXiv Detail & Related papers (2024-10-09T04:20:15Z)
Measuring Recency Bias In Sequential Recommendation Systems [4.797371814812293]
Recency bias in a sequential recommendation system refers to the overly high emphasis placed on recent items within a user session. This bias can diminish the serendipity of recommendations and hinder the system's ability to capture users' long-term interests. We propose a simple yet effective novel metric specifically designed to quantify recency bias.
arXiv Detail & Related papers (2024-09-15T13:02:50Z)
CSRec: Rethinking Sequential Recommendation from A Causal Perspective [25.69446083970207]
The essence of sequential recommender systems (RecSys) lies in understanding how users make decisions. We propose a novel formulation of sequential recommendation, termed Causal Sequential Recommendation (CSRec) CSRec aims to predict the probability of a recommended item's acceptance within a sequential context and backtrack how current decisions are made.
arXiv Detail & Related papers (2024-08-23T23:19:14Z)
Revisiting Reciprocal Recommender Systems: Metrics, Formulation, and Method [60.364834418531366]
We propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS. We formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions. We introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics.
arXiv Detail & Related papers (2024-08-19T07:21:02Z)
Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach [13.208141830901845]
We show that the standard difference-in-means estimator can lead to biased estimates due to recommender interference. We propose a "recommender choice model" that describes which item gets exposed from a pool containing both treated and control items. We show that the proposed estimator yields results comparable to the benchmark, whereas the standard difference-in-means estimator can exhibit significant bias and even produce reversed signs.
arXiv Detail & Related papers (2024-06-20T14:53:26Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Off-policy evaluation for learning-to-rank via interpolating the item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production. We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings. In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z)
Reward Imputation with Sketching for Contextual Batched Bandits [48.80803376405073]
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode. Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information. We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
arXiv Detail & Related papers (2022-10-13T04:26:06Z)
Breaking Feedback Loops in Recommender Systems with Causal Inference [99.22185950608838]
Recent work has shown that feedback loops may compromise recommendation quality and homogenize user behavior. We propose the Causal Adjustment for Feedback Loops (CAFL), an algorithm that provably breaks feedback loops using causal inference. We show that CAFL improves recommendation quality when compared to prior correction methods.
arXiv Detail & Related papers (2022-07-04T17:58:39Z)
Correcting the User Feedback-Loop Bias for Recommendation Systems [34.44834423714441]
We propose a systematic and dynamic way to correct user feedback-loop bias in recommendation systems. Our method includes a deep-learning component to learn each user's dynamic rating history embedding. We empirically validated the existence of such user feedback-loop bias in real world recommendation systems.
arXiv Detail & Related papers (2021-09-13T15:02:55Z)
Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies [3.855085732184416]
Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies. This paper discusses how to correctly mix estimators produced by different behavior policies. Experiments on simulated recommender systems show that our methods are effective in reducing the Mean-Square Error of estimation.
arXiv Detail & Related papers (2020-11-29T12:57:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.