Learning Personalized Ad Impact via Contextual Reinforcement Learning under Delayed Rewards
- URL: http://arxiv.org/abs/2510.20055v1
- Date: Wed, 22 Oct 2025 22:08:36 GMT
- Title: Learning Personalized Ad Impact via Contextual Reinforcement Learning under Delayed Rewards
- Authors: Yuwei Cheng, Zifeng Zhao, Haifeng Xu,
- Abstract summary: We model ad bidding as a Contextual Markov Decision Process (CMDP) with delayed Poisson rewards.<n>For efficient estimation, we propose a two-stage maximum likelihood estimator combined with data-splitting strategies.<n>We design a reinforcement learning algorithm to derive efficient personalized bidding strategies.
- Score: 36.029144318322686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online advertising platforms use automated auctions to connect advertisers with potential customers, requiring effective bidding strategies to maximize profits. Accurate ad impact estimation requires considering three key factors: delayed and long-term effects, cumulative ad impacts such as reinforcement or fatigue, and customer heterogeneity. However, these effects are often not jointly addressed in previous studies. To capture these factors, we model ad bidding as a Contextual Markov Decision Process (CMDP) with delayed Poisson rewards. For efficient estimation, we propose a two-stage maximum likelihood estimator combined with data-splitting strategies, ensuring controlled estimation error based on the first-stage estimator's (in)accuracy. Building on this, we design a reinforcement learning algorithm to derive efficient personalized bidding strategies. This approach achieves a near-optimal regret bound of $\tilde{O}{(dH^2\sqrt{T})}$, where $d$ is the contextual dimension, $H$ is the number of rounds, and $T$ is the number of customers. Our theoretical findings are validated by simulation experiments.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - Breaking Determinism: Stochastic Modeling for Reliable Off-Policy Evaluation in Ad Auctions [16.315158617837646]
This work contributes the first practical and validated framework for reliable Off-Policy Evaluation (OPE) in deterministic auction environments.<n>We introduce the first principled framework for OPE in deterministic auctions by repurposing the bid landscape model to approximate the propensity score.<n>We validate our approach on the AuctionNet simulation benchmark and against 2-weeks online A/B test from a large-scale industrial platform.
arXiv Detail & Related papers (2025-12-03T01:37:42Z) - ZIP-RC: Optimizing Test-Time Compute via Zero-Overhead Joint Reward-Cost Prediction [57.799425838564]
We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost.<n> ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost.
arXiv Detail & Related papers (2025-12-01T09:44:31Z) - Causal Inference under Threshold Manipulation: Bayesian Mixture Modeling and Heterogeneous Treatment Effects [0.25782420501870296]
We propose a novel framework for estimating the causal effect under threshold manipulation.<n>The main idea is to model the observed spending distribution as a mixture of two distributions.<n>We show posterior contraction of the causal effect under large samples.
arXiv Detail & Related papers (2025-09-24T06:52:53Z) - Learning Fair And Effective Points-Based Rewards Programs [4.465134753953128]
Points-based rewards programs have come under scrutiny due to accusations of unfair practices in their implementation.<n>We study the problem of fairly designing points-based rewards programs, with a focus on two obstacles that put fairness at odds with their effectiveness.<n>We show that an individually fair rewards program that uses the same redemption threshold for all customers suffers a loss in revenue of at most a factor of $1+ln 2$.<n>We propose a learning algorithm that limits the risk of point devaluation due to experimentation by only changing the redemption threshold $O(log T)$ times.
arXiv Detail & Related papers (2025-06-04T13:05:16Z) - Online Bidding under RoS Constraints without Knowing the Value [22.193658401789033]
We consider the problem of bidding in online advertising, where an advertiser aims to maximize value while adhering to budget and Return-on-Spend constraints.<n>We propose a novel Upper Confidence Bound (UCB)-style algorithm that carefully manages this trade-off.
arXiv Detail & Related papers (2025-03-05T05:25:54Z) - An Offline Learning Approach to Propagator Models [3.1755820123640612]
We consider an offline learning problem for an agent who first estimates an unknown price impact kernel from a static dataset.
We propose a novel approach for a nonparametric estimation of the propagator from a dataset containing correlated price trajectories, trading signals and metaorders.
We show that a trader who tries to minimise her execution costs by using a greedy strategy purely based on the estimated propagator will encounter suboptimality.
arXiv Detail & Related papers (2023-09-06T13:36:43Z) - Optimizing Credit Limit Adjustments Under Adversarial Goals Using
Reinforcement Learning [42.303733194571905]
We seek to find and automatize an optimal credit card limit adjustment policy by employing reinforcement learning techniques.
Our research establishes a conceptual structure for applying reinforcement learning framework to credit limit adjustment.
arXiv Detail & Related papers (2023-06-27T16:10:36Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Structured Dynamic Pricing: Optimal Regret in a Global Shrinkage Model [50.06663781566795]
We consider a dynamic model with the consumers' preferences as well as price sensitivity varying over time.
We measure the performance of a dynamic pricing policy via regret, which is the expected revenue loss compared to a clairvoyant that knows the sequence of model parameters in advance.
Our regret analysis results not only demonstrate optimality of the proposed policy but also show that for policy planning it is essential to incorporate available structural information.
arXiv Detail & Related papers (2023-03-28T00:23:23Z) - VFed-SSD: Towards Practical Vertical Federated Advertising [53.08038962443853]
We propose a semi-supervised split distillation framework VFed-SSD to alleviate the two limitations.
Specifically, we develop a self-supervised task MatchedPair Detection (MPD) to exploit the vertically partitioned unlabeled data.
Our framework provides an efficient federation-enhanced solution for real-time display advertising with minimal deploying cost and significant performance lift.
arXiv Detail & Related papers (2022-05-31T17:45:30Z) - Optimal Bidding Strategy without Exploration in Real-time Bidding [14.035270361462576]
maximizing utility with a budget constraint is the primary goal for advertisers in real-time bidding (RTB) systems.
Previous works ignore the losing auctions to alleviate the difficulty with censored states.
We propose a novel practical framework using the maximum entropy principle to imitate the behavior of the true distribution observed in real-time traffic.
arXiv Detail & Related papers (2020-03-31T20:43:28Z) - Cost-Sensitive Portfolio Selection via Deep Reinforcement Learning [100.73223416589596]
We propose a cost-sensitive portfolio selection method with deep reinforcement learning.
Specifically, a novel two-stream portfolio policy network is devised to extract both price series patterns and asset correlations.
A new cost-sensitive reward function is developed to maximize the accumulated return and constrain both costs via reinforcement learning.
arXiv Detail & Related papers (2020-03-06T06:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.