Related papers: BCRLSP: An Offline Reinforcement Learning Framework for Sequential Targeted Promotion

BCRLSP: An Offline Reinforcement Learning Framework for Sequential Targeted Promotion

URL: http://arxiv.org/abs/2207.07790v1
Date: Sat, 16 Jul 2022 00:10:12 GMT
Title: BCRLSP: An Offline Reinforcement Learning Framework for Sequential Targeted Promotion
Authors: Fanglin Chen, Xiao Liu, Bo Tang, Feiyu Xiong, Serim Hwang, and Guomian Zhuang
Abstract summary: We propose the Budget Constrained Reinforcement Learning for Sequential Promotion framework to determine the value of cash bonuses to be sent to users. We show that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines.
Score: 8.499811428928071
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We utilize an offline reinforcement learning (RL) model for sequential targeted promotion in the presence of budget constraints in a real-world business environment. In our application, the mobile app aims to boost customer retention by sending cash bonuses to customers and control the costs of such cash bonuses during each time period. To achieve the multi-task goal, we propose the Budget Constrained Reinforcement Learning for Sequential Promotion (BCRLSP) framework to determine the value of cash bonuses to be sent to users. We first find out the target policy and the associated Q-values that maximizes the user retention rate using an RL model. A linear programming (LP) model is then added to satisfy the constraints of promotion costs. We solve the LP problem by maximizing the Q-values of actions learned from the RL model given the budget constraints. During deployment, we combine the offline RL model with the LP model to generate a robust policy under the budget constraints. Using both online and offline experiments, we demonstrate the efficacy of our approach by showing that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines. Taking advantage of the near real-time cost control method, the proposed framework can easily adapt to data with a noisy behavioral policy and/or meet flexible budget constraints.

Related papers

Bridging Supervised and Temporal Difference Learning with $Q$-Conditioned Maximization [23.468621564156056]
Supervised learning (SL) has emerged as an effective approach for offline reinforcement learning (RL) due to their simplicity, stability, and efficiency.<n>Recent studies show that SL methods lack the trajectory stitching capability, typically associated with temporal difference (TD)-based approaches.<n>We introduce $Q$-conditioned supervised learning for offline goal-conditioned RL.
arXiv Detail & Related papers (2025-06-01T02:49:26Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Self-Regulation and Requesting Interventions [63.5863047447313]
We propose an offline framework that trains a "helper" policy to request interventions. We score optimal intervention timing with PRMs and train the helper model on these labeled trajectories. This offline approach significantly reduces costly intervention calls during training.
arXiv Detail & Related papers (2025-02-07T00:06:17Z)
Large Language Model driven Policy Exploration for Recommender Systems [50.70228564385797]
offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments. Online RL-based RS also face challenges in production deployment due to the risks of exposing users to untrained or unstable policies. Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline. We propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM.
arXiv Detail & Related papers (2025-01-23T16:37:44Z)
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates. We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning [57.154674117714265]
We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy. We empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
arXiv Detail & Related papers (2024-03-08T15:30:58Z)
Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning [11.666700714916065]
Constrained RL is a framework for enforcing safe actions in Reinforcement Learning. Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem. We present an approach that does not modify the trajectory based cost constraint and instead imitates good'' trajectories.
arXiv Detail & Related papers (2023-12-16T08:48:46Z)
COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z)
Value Penalized Q-Learning for Recommender Systems [30.704083806571074]
Scaling reinforcement learning to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS. A key approach to this goal is offline RL, which aims to learn policies from logged data. We propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL algorithm.
arXiv Detail & Related papers (2021-10-15T08:08:28Z)
MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical. We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization. Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z)
Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL. Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network. We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z)
MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. We show that an existing model-based RL algorithm already produces significant gains in the offline setting. We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.