Algorithm Adaptation Bias in Recommendation System Online Experiments
- URL: http://arxiv.org/abs/2509.00199v1
- Date: Fri, 29 Aug 2025 19:23:04 GMT
- Title: Algorithm Adaptation Bias in Recommendation System Online Experiments
- Authors: Chen Zheng, Zhenyu Zhao,
- Abstract summary: An underexplored but critical bias is algorithm adaptation effect.<n>Results often favor the production variant with large traffic while underestimating the performance of the test variant with small traffic.<n>We detail the mechanisms of this bias, present empirical evidence from real-world experiments, and discuss potential methods for a more robust online evaluation.
- Score: 4.8862630578310435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online experiments (A/B tests) are widely regarded as the gold standard for evaluating recommender system variants and guiding launch decisions. However, a variety of biases can distort the results of the experiment and mislead decision-making. An underexplored but critical bias is algorithm adaptation effect. This bias arises from the flywheel dynamics among production models, user data, and training pipelines: new models are evaluated on user data whose distributions are shaped by the incumbent system or tested only in a small treatment group. As a result, the measured effect of a new product change in modeling and user experience in this constrained experimental setting can diverge substantially from its true impact in full deployment. In practice, the experiment results often favor the production variant with large traffic while underestimating the performance of the test variant with small traffic, which leads to missing opportunities to launch a true winning arm or underestimating the impact. This paper aims to raise awareness of algorithm adaptation bias, situate it within the broader landscape of RecSys evaluation biases, and motivate discussion of solutions that span experiment design, measurement, and adjustment. We detail the mechanisms of this bias, present empirical evidence from real-world experiments, and discuss potential methods for a more robust online evaluation.
Related papers
- Practical Improvements of A/B Testing with Off-Policy Estimation [51.25970890274447]
We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach.<n>Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
arXiv Detail & Related papers (2025-06-12T13:11:01Z) - Prediction-Powered Causal Inferences [59.98498488132307]
We focus on Prediction-Powered Causal Inferences (PPCI)<n>We first show that conditional calibration guarantees valid PPCI at population level.<n>We then introduce a sufficient representation constraint transferring validity across experiments.
arXiv Detail & Related papers (2025-02-10T10:52:17Z) - Can We Validate Counterfactual Estimations in the Presence of General Network Interference? [13.49152464081862]
We introduce a framework that facilitates the use of machine learning tools for both estimation and validation in causal inference.<n>New distribution-preserving network bootstrap generates statistically-valid subpopulations from a single experiment's data.<n>Counterfactual cross-validation procedure adapts the principles of model validation to the unique constraints of causal settings.
arXiv Detail & Related papers (2025-02-03T06:51:04Z) - Post Launch Evaluation of Policies in a High-Dimensional Setting [4.710921988115686]
A/B tests, also known as randomized controlled experiments (RCTs), are the gold standard for evaluating the impact of new policies, products, or decisions.<n>This paper explores practical considerations in applying methodologies inspired by "synthetic control"<n>Synthetic control methods leverage data from unaffected units to estimate counterfactual outcomes for treated units.
arXiv Detail & Related papers (2024-12-30T19:35:29Z) - Adaptive Experimentation When You Can't Experiment [55.86593195947978]
This paper introduces the emphconfounded pure exploration transductive linear bandit (textttCPET-LB) problem.
Online services can employ a properly randomized encouragement that incentivizes users toward a specific treatment.
arXiv Detail & Related papers (2024-06-15T20:54:48Z) - Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference [50.95521705711802]
Previous studies have focused on addressing selection bias to achieve unbiased learning of the prediction model.
This paper formally formulates the neighborhood effect as an interference problem from the perspective of causal inference.
We propose a novel ideal loss that can be used to deal with selection bias in the presence of neighborhood effect.
arXiv Detail & Related papers (2024-04-30T15:20:41Z) - Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches [13.504353263032359]
The selection of the assumed effect size (AES) critically determines the duration of an experiment, and hence its accuracy and efficiency.
Traditionally, experimenters determine AES based on domain knowledge, but this method becomes impractical for online experimentation services managing numerous experiments.
We propose two solutions for data-driven AES selection in for online experimentation services.
arXiv Detail & Related papers (2023-12-20T09:34:28Z) - Adaptive Instrument Design for Indirect Experiments [48.815194906471405]
Unlike RCTs, indirect experiments estimate treatment effects by leveragingconditional instrumental variables.
In this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy.
Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy.
arXiv Detail & Related papers (2023-12-05T02:38:04Z) - A Common Misassumption in Online Experiments with Machine Learning
Models [1.52292571922932]
We argue that, because variants typically learn using pooled data, a lack of model interference cannot be guaranteed.
We discuss the implications this has for practitioners, and for the research literature.
arXiv Detail & Related papers (2023-04-21T11:36:44Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Fair Effect Attribution in Parallel Online Experiments [57.13281584606437]
A/B tests serve the purpose of reliably identifying the effect of changes introduced in online services.
It is common for online platforms to run a large number of simultaneous experiments by splitting incoming user traffic randomly.
Despite a perfect randomization between different groups, simultaneous experiments can interact with each other and create a negative impact on average population outcomes.
arXiv Detail & Related papers (2022-10-15T17:15:51Z) - Demarcating Endogenous and Exogenous Opinion Dynamics: An Experimental
Design Approach [27.975266406080152]
In this paper, we design a suite of unsupervised classification methods based on experimental design approaches.
We aim to select the subsets of events which minimize different measures of mean estimation error.
Our experiments range from validating prediction performance on unsanitized and sanitized events to checking the effect of selecting optimal subsets of various sizes.
arXiv Detail & Related papers (2021-02-11T11:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.