Related papers: Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking

Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking

URL: http://arxiv.org/abs/2508.00751v1
Date: Fri, 01 Aug 2025 16:28:18 GMT
Title: Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking
Authors: Qing Zhang, Alex Deng, Michelle Du, Huiji Gao, Liwei He, Sanjeev Katariya,
Abstract summary: Evaluation plays a crucial role in the development of ranking algorithms on search and recommender systems.<n>Online environment is conducive to applying causal inference techniques.<n>Business face unique challenges when it comes to effective A/B test.
Score: 14.97060265751423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluation plays a crucial role in the development of ranking algorithms on search and recommender systems. It enables online platforms to create user-friendly features that drive commercial success in a steady and effective manner. The online environment is particularly conducive to applying causal inference techniques, such as randomized controlled experiments (known as A/B test), which are often more challenging to implement in fields like medicine and public policy. However, businesses face unique challenges when it comes to effective A/B test. Specifically, achieving sufficient statistical power for conversion-based metrics can be time-consuming, especially for significant purchases like booking accommodations. While offline evaluations are quicker and more cost-effective, they often lack accuracy and are inadequate for selecting candidates for A/B test. To address these challenges, we developed interleaving and counterfactual evaluation methods to facilitate rapid online assessments for identifying the most promising candidates for A/B tests. Our approach not only increased the sensitivity of experiments by a factor of up to 100 (depending on the approach and metrics) compared to traditional A/B testing but also streamlined the experimental process. The practical insights gained from usage in production can also benefit organizations with similar interests.

Related papers

TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z)
Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction [56.17020601803071]
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction. This paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias.
arXiv Detail & Related papers (2024-03-15T02:04:35Z)
Efficient Weighting Schemes for Auditing Instant-Runoff Voting Elections [57.67176250198289]
AWAIRE involves adaptively weighted averages of test statistics, essentially "learning" an effective set of hypotheses to test. We explore schemes and settings more extensively, to identify and recommend efficient choices for practice. A limitation of the current AWAIRE implementation is its restriction to a small number of candidates.
arXiv Detail & Related papers (2024-02-18T10:13:01Z)
Variance Reduction in Ratio Metrics for Efficient Online Experiments [12.036747050794135]
We apply variance reduction techniques to ratio metrics on a large-scale short-video platform: ShareChat. Our results show that we can either improve A/B-test confidence in 77% of cases, or can retain the same level of confidence with 30% fewer data points.
arXiv Detail & Related papers (2024-01-08T18:01:09Z)
Better Practices for Domain Adaptation [62.70267990659201]
Domain adaptation (DA) aims to provide frameworks for adapting models to deployment data without using labels. Unclear validation protocol for DA has led to bad practices in the literature. We show challenges across all three branches of domain adaptation methodology.
arXiv Detail & Related papers (2023-09-07T17:44:18Z)
A/B Testing: A Systematic Literature Review [10.222047656342493]
Single classic A/B tests are the dominating type of tests. The dominating use of the test results are feature selection, feature rollout, and continued feature development. The main reported open problems are enhancement of proposed approaches and their usability.
arXiv Detail & Related papers (2023-08-09T12:55:51Z)
Experimentation Platforms Meet Reinforcement Learning: Bayesian Sequential Decision-Making for Continuous Monitoring [13.62951379287041]
In this paper, we introduce a novel framework that we developed in Amazon to maximize customer experience and control opportunity cost. We formulate the problem as a Bayesian optimal sequential decision making problem that has a unified utility function. We show the effectiveness of this novel approach compared with existing methods via a large-scale meta-analysis on experiments in Amazon.
arXiv Detail & Related papers (2023-04-02T00:59:10Z)
Clustering-based Imputation for Dropout Buyers in Large-scale Online Experimentation [4.753069295451989]
In online experimentation, appropriate metrics (e.g., purchase) provide strong evidence to support hypotheses and enhance the decision-making process. In this work, we introduce the concept of dropout buyers and categorize users with incomplete metric values into two groups: visitors and dropout buyers. For the analysis of incomplete metrics, we propose a clustering-based imputation method using $k$-nearest neighbors.
arXiv Detail & Related papers (2022-09-09T01:05:53Z)
Confidence-Aware Active Feedback for Efficient Instance Search [21.8172170825049]
Relevance feedback is widely used in instance search (INS) tasks to further refine imperfect ranking results. We propose a confidence-aware active feedback (CAAF) method that can efficiently select the most valuable feedback candidates. In particular, CAAF outperforms the first-place record in the public large-scale video INS evaluation of TRECVID 2021.
arXiv Detail & Related papers (2021-10-23T16:14:03Z)
Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation. The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles. We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z)
Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework [68.96770035057716]
A/B testing is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. This paper introduces a reinforcement learning framework for carrying A/B testing in online experiments.
arXiv Detail & Related papers (2020-02-05T10:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.