Sustainable Online Reinforcement Learning for Auto-bidding
- URL: http://arxiv.org/abs/2210.07006v1
- Date: Thu, 13 Oct 2022 13:17:20 GMT
- Title: Sustainable Online Reinforcement Learning for Auto-bidding
- Authors: Zhiyu Mou, Yusen Huo, Rongquan Bai, Mingzhou Xie, Chuan Yu, Jian Xu,
Bo Zheng
- Abstract summary: State-of-the-art auto-bidding policies usually leverage reinforcement learning (RL) algorithms to generate real-time bids on behalf of the advertisers.
Due to safety concerns, it was believed that the RL training process can only be carried out in an offline virtual advertising system (VAS) that is built based on the historical data generated in the RAS.
In this paper, we argue that there exists significant gaps between the VAS and RAS, making the RL training process suffer from the problem of inconsistency between online and offline.
- Score: 10.72140135793476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, auto-bidding technique has become an essential tool to increase the
revenue of advertisers. Facing the complex and ever-changing bidding
environments in the real-world advertising system (RAS), state-of-the-art
auto-bidding policies usually leverage reinforcement learning (RL) algorithms
to generate real-time bids on behalf of the advertisers. Due to safety
concerns, it was believed that the RL training process can only be carried out
in an offline virtual advertising system (VAS) that is built based on the
historical data generated in the RAS. In this paper, we argue that there exists
significant gaps between the VAS and RAS, making the RL training process suffer
from the problem of inconsistency between online and offline (IBOO). Firstly,
we formally define the IBOO and systematically analyze its causes and
influences. Then, to avoid the IBOO, we propose a sustainable online RL (SORL)
framework that trains the auto-bidding policy by directly interacting with the
RAS, instead of learning in the VAS. Specifically, based on our proof of the
Lipschitz smooth property of the Q function, we design a safe and efficient
online exploration (SER) policy for continuously collecting data from the RAS.
Meanwhile, we derive the theoretical lower bound on the safety of the SER
policy. We also develop a variance-suppressed conservative Q-learning (V-CQL)
method to effectively and stably learn the auto-bidding policy with the
collected data. Finally, extensive simulated and real-world experiments
validate the superiority of our approach over the state-of-the-art auto-bidding
algorithm.
Related papers
- OffRIPP: Offline RL-based Informative Path Planning [12.705099730591671]
IPP is a crucial task in robotics, where agents must design paths to gather valuable information about a target environment.
We propose an offline RL-based IPP framework that optimized information gain without requiring real-time interaction during training.
We validate the framework through extensive simulations and real-world experiments.
arXiv Detail & Related papers (2024-09-25T11:30:59Z) - Trajectory-wise Iterative Reinforcement Learning Framework for Auto-bidding [16.556934508295456]
In online advertising, advertisers participate in ad auctions to acquire ad opportunities, often by utilizing auto-bidding tools provided by demand-side platforms (DSPs)
Due to safety concerns, most RL-based auto-bidding policies are trained in simulation, leading to a performance degradation when deployed in online environments.
We propose Trajectory-wise Exploration and Exploitation (TEE), which introduces a novel data collecting and data utilization method for iterative offline RL.
arXiv Detail & Related papers (2024-02-23T05:20:23Z) - Safety-aware Causal Representation for Trustworthy Offline Reinforcement
Learning in Autonomous Driving [33.672722472758636]
offline Reinforcement Learning(RL) approaches exhibit notable efficacy in addressing sequential decision-making problems from offline datasets.
We introduce the saFety-aware strUctured Scenario representatION ( Fusion) to facilitate the learning of a generalizable end-to-end driving policy.
Empirical evidence in various driving scenarios attests that Fusion significantly enhances the safety and generalizability of autonomous driving agents.
arXiv Detail & Related papers (2023-10-31T18:21:24Z) - Guided Online Distillation: Promoting Safe Reinforcement Learning by
Offline Demonstration [75.51109230296568]
We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue.
We propose Guided Online Distillation (GOLD), an offline-to-online safe RL framework.
GOLD distills an offline DT policy into a lightweight policy network through guided online safe RL training, which outperforms both the offline DT policy and online safe RL algorithms.
arXiv Detail & Related papers (2023-09-18T00:22:59Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - UMBRELLA: Uncertainty-Aware Model-Based Offline Reinforcement Learning
Leveraging Planning [1.1339580074756188]
Offline reinforcement learning (RL) provides a framework for learning decision-making from offline data.
Self-driving vehicles (SDV) learn a policy, which potentially even outperforms the behavior in the sub-optimal data set.
This motivates the use of model-based offline RL approaches, which leverage planning.
arXiv Detail & Related papers (2021-11-22T10:37:52Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.