Off-Policy Evaluation and Learning for Matching Markets
- URL: http://arxiv.org/abs/2507.13608v1
- Date: Fri, 18 Jul 2025 02:23:37 GMT
- Title: Off-Policy Evaluation and Learning for Matching Markets
- Authors: Yudai Hayashi, Shuhei Goda, Yuta Saito,
- Abstract summary: Off-Policy Evaluation (OPE) plays a crucial role by enabling the evaluation of recommendation policies using only offline logged data.<n>We propose novel OPE estimators, textitDiPS and textitDPR, specifically designed for matching markets.<n>Our methods combine elements of the Direct Method (DM), Inverse Propensity Score (IPS), and Doubly Robust (DR) estimators.
- Score: 15.585641615174623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Matching users based on mutual preferences is a fundamental aspect of services driven by reciprocal recommendations, such as job search and dating applications. Although A/B tests remain the gold standard for evaluating new policies in recommender systems for matching markets, it is costly and impractical for frequent policy updates. Off-Policy Evaluation (OPE) thus plays a crucial role by enabling the evaluation of recommendation policies using only offline logged data naturally collected on the platform. However, unlike conventional recommendation settings, the large scale and bidirectional nature of user interactions in matching platforms introduce variance issues and exacerbate reward sparsity, making standard OPE methods unreliable. To address these challenges and facilitate effective offline evaluation, we propose novel OPE estimators, \textit{DiPS} and \textit{DPR}, specifically designed for matching markets. Our methods combine elements of the Direct Method (DM), Inverse Propensity Score (IPS), and Doubly Robust (DR) estimators while incorporating intermediate labels, such as initial engagement signals, to achieve better bias-variance control in matching markets. Theoretically, we derive the bias and variance of the proposed estimators and demonstrate their advantages over conventional methods. Furthermore, we show that these estimators can be seamlessly extended to offline policy learning methods for improving recommendation policies for making more matches. We empirically evaluate our methods through experiments on both synthetic data and A/B testing logs from a real job-matching platform. The empirical results highlight the superiority of our approach over existing methods in off-policy evaluation and learning tasks for a variety of configurations.
Related papers
- Revisiting Reciprocal Recommender Systems: Metrics, Formulation, and Method [60.364834418531366]
We propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS.
We formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions.
We introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics.
arXiv Detail & Related papers (2024-08-19T07:21:02Z) - $Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies [13.528097424046823]
We introduce $Deltatext-rm OPE$ methods based on the widely used Inverse Propensity Scoring estimator.
Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
arXiv Detail & Related papers (2024-05-16T12:04:55Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - An IPW-based Unbiased Ranking Metric in Two-sided Markets [3.845857580909374]
This paper addresses the complex interaction of biases between users in two-sided markets.
We propose a new estimator, named two-sided IPW, to address the position bases in two-sided markets.
arXiv Detail & Related papers (2023-07-14T01:44:03Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Evaluating the Robustness of Off-Policy Evaluation [10.760026478889664]
Off-policy Evaluation (OPE) evaluates the performance of hypothetical policies leveraging only offline log data.
It is particularly useful in applications where the online interaction involves high stakes and expensive setting.
We develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators' robustness.
arXiv Detail & Related papers (2021-08-31T09:33:13Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Counterfactual Evaluation of Slate Recommendations with Sequential
Reward Interactions [18.90946044396516]
Music streaming, video streaming, news recommendation, and e-commerce services often engage with content in a sequential manner.
Providing and evaluating good sequences of recommendations is therefore a central problem for these services.
We propose a new counterfactual estimator that allows for sequential interactions in the rewards with lower variance in anally unbiased manner.
arXiv Detail & Related papers (2020-07-25T17:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.