BCRLSP: An Offline Reinforcement Learning Framework for Sequential
Targeted Promotion
- URL: http://arxiv.org/abs/2207.07790v1
- Date: Sat, 16 Jul 2022 00:10:12 GMT
- Title: BCRLSP: An Offline Reinforcement Learning Framework for Sequential
Targeted Promotion
- Authors: Fanglin Chen, Xiao Liu, Bo Tang, Feiyu Xiong, Serim Hwang, and Guomian
Zhuang
- Abstract summary: We propose the Budget Constrained Reinforcement Learning for Sequential Promotion framework to determine the value of cash bonuses to be sent to users.
We show that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines.
- Score: 8.499811428928071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We utilize an offline reinforcement learning (RL) model for sequential
targeted promotion in the presence of budget constraints in a real-world
business environment. In our application, the mobile app aims to boost customer
retention by sending cash bonuses to customers and control the costs of such
cash bonuses during each time period. To achieve the multi-task goal, we
propose the Budget Constrained Reinforcement Learning for Sequential Promotion
(BCRLSP) framework to determine the value of cash bonuses to be sent to users.
We first find out the target policy and the associated Q-values that maximizes
the user retention rate using an RL model. A linear programming (LP) model is
then added to satisfy the constraints of promotion costs. We solve the LP
problem by maximizing the Q-values of actions learned from the RL model given
the budget constraints. During deployment, we combine the offline RL model with
the LP model to generate a robust policy under the budget constraints. Using
both online and offline experiments, we demonstrate the efficacy of our
approach by showing that BCRLSP achieves a higher long-term customer retention
rate and a lower cost than various baselines. Taking advantage of the near
real-time cost control method, the proposed framework can easily adapt to data
with a noisy behavioral policy and/or meet flexible budget constraints.
Related papers
- VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates.
We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning [57.154674117714265]
We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy.
We empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
arXiv Detail & Related papers (2024-03-08T15:30:58Z) - Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning [11.666700714916065]
Constrained RL is a framework for enforcing safe actions in Reinforcement Learning.
Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem.
We present an approach that does not modify the trajectory based cost constraint and instead imitates good'' trajectories.
arXiv Detail & Related papers (2023-12-16T08:48:46Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Value Penalized Q-Learning for Recommender Systems [30.704083806571074]
Scaling reinforcement learning to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS.
A key approach to this goal is offline RL, which aims to learn policies from logged data.
We propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL algorithm.
arXiv Detail & Related papers (2021-10-15T08:08:28Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL.
Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.