Related papers: Generative Bid Shading in Real-Time Bidding Advertising

Generative Bid Shading in Real-Time Bidding Advertising

URL: http://arxiv.org/abs/2508.06550v1
Date: Wed, 06 Aug 2025 03:34:49 GMT
Title: Generative Bid Shading in Real-Time Bidding Advertising
Authors: Yinqiu Huang, Hao Ma, Wenshuai Chen, Shuli Wang, Yongqiang Zhang, Xue Wei, Yinhua Zhu, Haitao Wang, Xingxing Wang,
Abstract summary: This paper introduces Generative Bid Shading(GBS) as an end-to-end generative model.<n>It incorporates an autoregressive approach to generate ratios by capturing stepwise residual reward models.<n>It has been deployed on the Meit platform serving billions of bid requests daily.
Score: 7.7746704524695485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bid shading plays a crucial role in Real-Time Bidding~(RTB) by adaptively adjusting the bid to avoid advertisers overspending. Existing mainstream two-stage methods, which first model bid landscapes and then optimize surplus using operations research techniques, are constrained by unimodal assumptions that fail to adapt for non-convex surplus curves and are vulnerable to cascading errors in sequential workflows. Additionally, existing discretization models of continuous values ignore the dependence between discrete intervals, reducing the model's error correction ability, while sample selection bias in bidding scenarios presents further challenges for prediction. To address these issues, this paper introduces Generative Bid Shading~(GBS), which comprises two primary components: (1) an end-to-end generative model that utilizes an autoregressive approach to generate shading ratios by stepwise residuals, capturing complex value dependencies without relying on predefined priors; and (2) a reward preference alignment system, which incorporates a channel-aware hierarchical dynamic network~(CHNet) as the reward model to extract fine-grained features, along with modules for surplus optimization and exploration utility reward alignment, ultimately optimizing both short-term and long-term surplus using group relative policy optimization~(GRPO). Extensive experiments on both offline and online A/B tests validate GBS's effectiveness. Moreover, GBS has been deployed on the Meituan DSP platform, serving billions of bid requests daily.

Related papers

Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z)
Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving [54.46325690390831]
We propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment.<n>MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine.<n>MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes.
arXiv Detail & Related papers (2025-11-26T17:01:41Z)
Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation [6.597818816347323]
Direct Preference Optimization aims to align generative outputs with human preferences by distinguishing between chosen and rejected samples.<n>A critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training.<n>We introduce a novel solution named Policy-Guided DPO, combining Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR)<n>Experiments show that PG-DPO outperforms existing methods in both quantitative metrics and qualitative evaluations.
arXiv Detail & Related papers (2025-11-24T12:37:49Z)
DUET: Dual Model Co-Training for Entire Space CTR Prediction [34.35929309131385]
textbfDUET (textbfDUal Model Co-Training for textbfDUal Model Co-Training for textbfEntire Space CtextbfTR Prediction) is a set-wise pre-ranking framework that achieves expressive modeling under tight computational budgets.<n>It consistently outperforms state-of-the-art baselines and achieves improvements across multiple core business metrics.
arXiv Detail & Related papers (2025-10-28T12:46:33Z)
Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models [31.470613363668672]
Adaptive Divergence Regularized Policy Optimization automatically adjusts regularization strength based on advantage estimates.<n>Our implementation with Wasserstein-2 regularization for flow matching generative models achieves remarkable results on text-to-image generation.<n> ADRPO generalizes to KL-regularized fine-tuning of both text-only LLMs and multi-modal reasoning models.
arXiv Detail & Related papers (2025-10-20T19:46:02Z)
When Relevance Meets Novelty: Dual-Stable Periodic Optimization for Exploratory Recommendation [6.663356205396985]
Large language models (LLMs) demonstrate potential with their diverse content generation capabilities.<n>Existing LLM-enhanced dual-model frameworks face two major limitations.<n>First, they overlook long-term preferences driven by group identity, leading to biased interest modeling.<n>Second, they suffer from static optimization flaws, as a one-time alignment process fails to leverage incremental user data for closed-loop optimization.
arXiv Detail & Related papers (2025-08-01T09:10:56Z)
Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update [60.414548453838506]
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function.<n>GLBs are widely applicable to real-world scenarios, but their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency.<n>We propose a jointly efficient algorithm that attains a nearly optimal regret bound with $mathcalO(1)$ time and space complexities per round.
arXiv Detail & Related papers (2025-07-16T02:24:21Z)
Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations [67.35596444651037]
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable.<n>We propose a Reliable Test-time Adaptation (ReTA) method that enhances reliability from two perspectives.
arXiv Detail & Related papers (2025-07-13T05:37:33Z)
Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.<n>CaPO incorporates the general preference from multiple reward models without human annotated data.<n> Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z)
Learning While Repositioning in On-Demand Vehicle Sharing Networks [4.724825031148413]
We consider a network inventory problem motivated by one-way, on-demand vehicle sharing services.<n>We show that a natural Lipschitz-bandit approach achieves a regret guarantee of $widetildeO(Tfracnn+1)$, which suffers from the exponential dependence on $n$.<n>Motivated by these challenges, we propose an Online Gradient Repositioning algorithm that relies solely on censored demand.
arXiv Detail & Related papers (2025-01-31T15:16:02Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations. We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)
Constrained Online Two-stage Stochastic Optimization: Near Optimal Algorithms via Adversarial Learning [1.994307489466967]
We consider an online two-stage optimization with long-term constraints over a finite horizon of $T$ periods. We develop online algorithms for the online two-stage problem from adversarial learning algorithms.
arXiv Detail & Related papers (2023-02-02T10:33:09Z)
Optimal Bidding Strategy without Exploration in Real-time Bidding [14.035270361462576]
maximizing utility with a budget constraint is the primary goal for advertisers in real-time bidding (RTB) systems. Previous works ignore the losing auctions to alleviate the difficulty with censored states. We propose a novel practical framework using the maximum entropy principle to imitate the behavior of the true distribution observed in real-time traffic.
arXiv Detail & Related papers (2020-03-31T20:43:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.