Breaking Determinism: Stochastic Modeling for Reliable Off-Policy Evaluation in Ad Auctions
- URL: http://arxiv.org/abs/2512.03354v1
- Date: Wed, 03 Dec 2025 01:37:42 GMT
- Title: Breaking Determinism: Stochastic Modeling for Reliable Off-Policy Evaluation in Ad Auctions
- Authors: Hongseon Yeom, Jaeyoul Shin, Soojin Min, Jeongmin Yoon, Seunghak Yu, Dongyeop Kang,
- Abstract summary: This work contributes the first practical and validated framework for reliable Off-Policy Evaluation (OPE) in deterministic auction environments.<n>We introduce the first principled framework for OPE in deterministic auctions by repurposing the bid landscape model to approximate the propensity score.<n>We validate our approach on the AuctionNet simulation benchmark and against 2-weeks online A/B test from a large-scale industrial platform.
- Score: 16.315158617837646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online A/B testing, the gold standard for evaluating new advertising policies, consumes substantial engineering resources and risks significant revenue loss from deploying underperforming variations. This motivates the use of Off-Policy Evaluation (OPE) for rapid, offline assessment. However, applying OPE to ad auctions is fundamentally more challenging than in domains like recommender systems, where stochastic policies are common. In online ad auctions, it is common for the highest-bidding ad to win the impression, resulting in a deterministic, winner-takes-all setting. This results in zero probability of exposure for non-winning ads, rendering standard OPE estimators inapplicable. We introduce the first principled framework for OPE in deterministic auctions by repurposing the bid landscape model to approximate the propensity score. This model allows us to derive robust approximate propensity scores, enabling the use of stable estimators like Self-Normalized Inverse Propensity Scoring (SNIPS) for counterfactual evaluation. We validate our approach on the AuctionNet simulation benchmark and against 2-weeks online A/B test from a large-scale industrial platform. Our method shows remarkable alignment with online results, achieving a 92\% Mean Directional Accuracy (MDA) in CTR prediction, significantly outperforming the parametric baseline. MDA is the most critical metric for guiding deployment decisions, as it reflects the ability to correctly predict whether a new model will improve or harm performance. This work contributes the first practical and validated framework for reliable OPE in deterministic auction environments, offering an efficient alternative to costly and risky online experiments.
Related papers
- Towards Anytime-Valid Statistical Watermarking [63.02116925616554]
We develop the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference.<n>Our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-19T18:32:26Z) - Profit over Proxies: A Scalable Bayesian Decision Framework for Optimizing Multi-Variant Online Experiments [0.0352925259310339]
Online controlled experiments (A/B tests) are fundamental to data-driven decision-making in the digital economy.<n>"p-value" inflates false positive rates, and an over-reliance on proxy metrics like conversion rates can lead to decisions that inadvertently harm core business profitability.<n>This paper introduces a comprehensive and scalable Bayesian decision framework designed for profit optimization in multi-variant (A/B/n) experiments.
arXiv Detail & Related papers (2025-09-16T02:24:20Z) - Off-Policy Evaluation and Learning for Matching Markets [15.585641615174623]
Off-Policy Evaluation (OPE) plays a crucial role by enabling the evaluation of recommendation policies using only offline logged data.<n>We propose novel OPE estimators, textitDiPS and textitDPR, specifically designed for matching markets.<n>Our methods combine elements of the Direct Method (DM), Inverse Propensity Score (IPS), and Doubly Robust (DR) estimators.
arXiv Detail & Related papers (2025-07-18T02:23:37Z) - PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy
Evaluation [17.319113169622806]
Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data.
Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection.
We develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator.
arXiv Detail & Related papers (2023-11-30T02:56:49Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Delving into Probabilistic Uncertainty for Unsupervised Domain Adaptive
Person Re-Identification [54.174146346387204]
We propose an approach named probabilistic uncertainty guided progressive label refinery (P$2$LR) for domain adaptive person re-identification.
A quantitative criterion is established to measure the uncertainty of pseudo labels and facilitate the network training.
Our method outperforms the baseline by 6.5% mAP on the Duke2Market task, while surpassing the state-of-the-art method by 2.5% mAP on the Market2MSMT task.
arXiv Detail & Related papers (2021-12-28T07:40:12Z) - Arbitrary Distribution Modeling with Censorship in Real-Time Bidding
Advertising [2.562910030418378]
The purpose of Inventory Pricing is to bid the right prices to online ad opportunities, which is crucial for a Demand-Side Platform (DSP) to win auctions in Real-Time Bidding (RTB)
Most of the previous works made strong assumptions on the distribution form of the winning price, which reduced their accuracy and weakened their ability to make generalizations.
We propose a novel loss function, Neighborhood Likelihood Loss (NLL), collaborating with a proposed framework, Arbitrary Distribution Modeling (ADM) to predict the winning price distribution under censorship.
arXiv Detail & Related papers (2021-10-26T11:40:00Z) - Optimal Bidding Strategy without Exploration in Real-time Bidding [14.035270361462576]
maximizing utility with a budget constraint is the primary goal for advertisers in real-time bidding (RTB) systems.
Previous works ignore the losing auctions to alleviate the difficulty with censored states.
We propose a novel practical framework using the maximum entropy principle to imitate the behavior of the true distribution observed in real-time traffic.
arXiv Detail & Related papers (2020-03-31T20:43:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.