Related papers: Offline Reinforcement Learning for Optimizing Production Bidding Policies

Offline Reinforcement Learning for Optimizing Production Bidding Policies

URL: http://arxiv.org/abs/2310.09426v1
Date: Fri, 13 Oct 2023 22:14:51 GMT
Title: Offline Reinforcement Learning for Optimizing Production Bidding Policies
Authors: Dmytro Korenkevych, Frank Cheng, Artsiom Balakir, Alex Nikulkov, Lingnan Gao, Zhihao Cen, Zuobing Xu, Zheqing Zhu
Abstract summary: We propose a generalizable approach to optimizing bidding policies in production environments. We use a hybrid agent architecture that combines arbitrary base policies with deep neural networks. We demonstrate that such an architecture achieves statistically significant performance gains in both simulated and at-scale production bidding environments.
Score: 1.8689461238197953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The online advertising market, with its thousands of auctions run per second, presents a daunting challenge for advertisers who wish to optimize their spend under a budget constraint. Thus, advertising platforms typically provide automated agents to their customers, which act on their behalf to bid for impression opportunities in real time at scale. Because these proxy agents are owned by the platform but use advertiser funds to operate, there is a strong practical need to balance reliability and explainability of the agent with optimizing power. We propose a generalizable approach to optimizing bidding policies in production environments by learning from real data using offline reinforcement learning. This approach can be used to optimize any differentiable base policy (practically, a heuristic policy based on principles which the advertiser can easily understand), and only requires data generated by the base policy itself. We use a hybrid agent architecture that combines arbitrary base policies with deep neural networks, where only the optimized base policy parameters are eventually deployed, and the neural network part is discarded after training. We demonstrate that such an architecture achieves statistically significant performance gains in both simulated and at-scale production bidding environments. Our approach does not incur additional infrastructure, safety, or explainability costs, as it directly optimizes parameters of existing production routines without replacing them with black box-style models like neural networks.

Related papers

Large Language Model driven Policy Exploration for Recommender Systems [50.70228564385797]
offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments. Online RL-based RS also face challenges in production deployment due to the risks of exposing users to untrained or unstable policies. Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline. We propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM.
arXiv Detail & Related papers (2025-01-23T16:37:44Z)
Hierarchical Multi-agent Meta-Reinforcement Learning for Cross-channel Bidding [4.741091524027138]
Real-time bidding (RTB) plays a pivotal role in online advertising ecosystems. Traditional approaches cannot effectively manage the dynamic budget allocation problem. We propose a hierarchical multi-agent reinforcement learning framework for multi-channel bidding optimization.
arXiv Detail & Related papers (2024-12-26T05:26:30Z)
GAS: Generative Auto-bidding with Post-training Search [26.229396732360787]
We propose a flexible and practical Generative Auto-bidding scheme using post-training Search, termed GAS, to refine a base policy model's output. Experiments conducted on the real-world dataset and online A/B test on the Kuaishou advertising platform demonstrate the effectiveness of GAS.
arXiv Detail & Related papers (2024-12-22T13:47:46Z)
Bayesian Design Principles for Offline-to-Online Reinforcement Learning [50.97583504192167]
offline-to-online fine-tuning is crucial for real-world applications where exploration can be costly or unsafe. In this paper, we tackle the dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma.
arXiv Detail & Related papers (2024-05-31T16:31:07Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
Maximizing the Success Probability of Policy Allocations in Online Systems [5.485872703839928]
In this paper we consider the problem at the level of user timelines instead of individual bid requests. In order to optimally allocate policies to users, typical multiple treatments allocation methods solve knapsack-like problems. We introduce the SuccessProMax algorithm that aims at finding the policy allocation which is the most likely to outperform a fixed policy.
arXiv Detail & Related papers (2023-12-26T10:55:33Z)
Insurance pricing on price comparison websites via reinforcement learning [7.023335262537794]
This paper introduces reinforcement learning framework that learns optimal pricing policy by integrating model-based and model-free methods. The paper also highlights the importance of evaluating pricing policies using an offline dataset in a consistent fashion.
arXiv Detail & Related papers (2023-08-14T04:44:56Z)
Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization. Regularization methods reduce the divergence between the learned policy and the behavior policy. This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z)
Model-Free Learning of Optimal Deterministic Resource Allocations in Wireless Systems via Action-Space Exploration [4.721069729610892]
We propose a technically grounded and scalable deterministic-dual gradient policy method for efficiently learning optimal parameterized resource allocation policies. Our method not only efficiently exploits gradient availability of popular universal representations such as deep networks, but is also truly model-free, as it relies on consistent zeroth-order gradient approximations of associated random network services constructed via low-dimensional perturbations in action space.
arXiv Detail & Related papers (2021-08-23T18:26:16Z)
Building a Foundation for Data-Driven, Interpretable, and Robust Policy Design using the AI Economist [67.08543240320756]
We show that the AI Economist framework enables effective, flexible, and interpretable policy design using two-level reinforcement learning and data-driven simulations. We find that log-linear policies trained using RL significantly improve social welfare, based on both public health and economic outcomes, compared to past outcomes.
arXiv Detail & Related papers (2021-08-06T01:30:41Z)
OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy. We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z)
Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization [46.017212565714175]
We propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works.
arXiv Detail & Related papers (2020-06-05T19:33:19Z)
Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning [75.56839075060819]
Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state. reinforcement learning approaches can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sample-inefficient and brittle. In this work, we combine the strengths of model-based methods with the flexibility of learning-based methods to obtain a general method that is able to overcome inaccuracies in the robotics perception/actuation pipeline.
arXiv Detail & Related papers (2020-05-21T19:47:05Z)
Optimal Bidding Strategy without Exploration in Real-time Bidding [14.035270361462576]
maximizing utility with a budget constraint is the primary goal for advertisers in real-time bidding (RTB) systems. Previous works ignore the losing auctions to alleviate the difficulty with censored states. We propose a novel practical framework using the maximum entropy principle to imitate the behavior of the true distribution observed in real-time traffic.
arXiv Detail & Related papers (2020-03-31T20:43:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.