Supervised Learning-enhanced Multi-Group Actor Critic for Live Stream Allocation in Feed
- URL: http://arxiv.org/abs/2412.10381v6
- Date: Mon, 26 May 2025 03:49:37 GMT
- Title: Supervised Learning-enhanced Multi-Group Actor Critic for Live Stream Allocation in Feed
- Authors: Jingxin Liu, Xiang Gao, Yisha Li, Xin Li, Haiyang Lu, Ben Wang,
- Abstract summary: We propose a novel Supervised Learning-enhanced Multi-Group Actor Critic algorithm (SL-MGAC)<n>We introduce a supervised learning-enhanced actor-critic framework that incorporates variance reduction techniques.<n>We also propose a novel reward function to prevent overly greedy live stream allocation.
- Score: 14.545253604335823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the context of a short video & live stream mixed recommendation scenario, the live stream recommendation system (RS) decides whether to allocate at most one live stream into the video feed for each user request. To maximize long-term user engagement, it is crucial to determine an optimal live stream policy for accurate live stream allocation. The inappropriate live stream allocation policy can significantly affect the duration of the usage app and user retention, which ignores the long-term negative impact of live stream allocation. Recently, reinforcement learning (RL) has been widely applied in recommendation systems to capture long-term user engagement. However, traditional RL algorithms often face divergence and instability problems, which restricts the application and deployment in the large-scale industrial recommendation systems, especially in the aforementioned challenging scenario. To address these challenges, we propose a novel Supervised Learning-enhanced Multi-Group Actor Critic algorithm (SL-MGAC). Specifically, we introduce a supervised learning-enhanced actor-critic framework that incorporates variance reduction techniques, where multi-task reward learning helps restrict bootstrapping error accumulation during critic learning. Additionally, we design a multi-group state decomposition module for both actor and critic networks to reduce prediction variance and improve model stability. We also propose a novel reward function to prevent overly greedy live stream allocation. Empirically, we evaluate the SL-MGAC algorithm using offline policy evaluation (OPE) and online A/B testing. Experimental results demonstrate that the proposed method not only outperforms baseline methods under the platform-level constraints but also exhibits enhanced stability in online recommendation scenarios.
Related papers
- Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - Actor-Critic without Actor [4.94481688445056]
We introduce Actor-Critic without Actor (ACA), a lightweight framework that eliminates the explicit actor network and instead generates actions directly from the field of a noise-level critic.<n>ACA achieves more favorable learning curves and competitive performance compared to both standard actor-critic and state-of-the-art diffusion-based methods.
arXiv Detail & Related papers (2025-09-25T11:33:09Z) - Large Language Model-Enhanced Reinforcement Learning for Diverse and Novel Recommendations [6.949170757786365]
We propose LAAC (LLM-guided Adversarial Actor Critic), a novel method that leverages large language models to suggest novel items.<n>We show that LAAC outperforms existing baselines in diversity, novelty, and accuracy, while remaining robust on imbalanced data.
arXiv Detail & Related papers (2025-07-28T19:00:40Z) - Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [62.579951798437115]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z) - Prior-Guided Diffusion Planning for Offline Reinforcement Learning [4.760537994346813]
Prior Guidance (PG) is a novel guided sampling framework that replaces the standard Gaussian prior-of-cloned diffusion model.<n>PG directly generates high-value trajectories without costly reward optimization of the diffusion model itself.<n>We present an efficient training strategy that applies behavior regularization in latent space, and empirically demonstrate that PG outperforms state-the-art diffusion policies and planners across diverse long-horizon offline RL benchmarks.
arXiv Detail & Related papers (2025-05-16T05:39:02Z) - Value Function Decomposition in Markov Recommendation Process [19.082512423102855]
We propose an online reinforcement learning framework to improve recommender performance.
We show that these two factors can be separately approximated by decomposing the original temporal difference loss.
The disentangled learning framework can achieve a more accurate estimation with faster learning and improved robustness against action exploration.
arXiv Detail & Related papers (2025-01-29T04:22:29Z) - Large Language Model driven Policy Exploration for Recommender Systems [50.70228564385797]
offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments.
Online RL-based RS also face challenges in production deployment due to the risks of exposing users to untrained or unstable policies.
Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline.
We propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM.
arXiv Detail & Related papers (2025-01-23T16:37:44Z) - Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning [5.012314384895537]
In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment.
We propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions.
arXiv Detail & Related papers (2024-11-07T09:35:22Z) - An Efficient Continuous Control Perspective for Reinforcement-Learning-based Sequential Recommendation [14.506332665769746]
We propose an underlinetextbfEfficient underlinetextbfContinuous underlinetextbfControl framework (ECoC)
Based on a statistically tested assumption, we first propose the novel unified action representation abstracted from normalized user and item spaces.
During this process, strategic exploration and directional control in terms of unified actions are carefully designed and crucial to final recommendation decisions.
arXiv Detail & Related papers (2024-08-15T09:26:26Z) - Meta Clustering of Neural Bandits [45.77505279698894]
We study a new problem, Clustering of Neural Bandits, by extending previous work to the arbitrary reward function.
We propose a novel algorithm called M-CNB, which utilizes a meta-learner to represent and rapidly adapt to dynamic clusters.
In extensive experiments conducted in both recommendation and online classification scenarios, M-CNB outperforms SOTA baselines.
arXiv Detail & Related papers (2024-08-10T16:09:51Z) - StreamBench: Towards Benchmarking Continuous Improvement of Language Agents [63.54557575233165]
Large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment.
We introduce StreamBench, a benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence.
Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios.
arXiv Detail & Related papers (2024-06-13T02:08:28Z) - Leave No One Behind: Online Self-Supervised Self-Distillation for Sequential Recommendation [20.52842524024608]
Sequential recommendation methods play a pivotal role in modern recommendation systems.
Recent methods leverage contrastive learning to derive self-supervision signals.
We introduce a novel learning paradigm, named Online Self-Supervised Self-distillation for Sequential Recommendation.
arXiv Detail & Related papers (2024-03-22T12:27:21Z) - AdaRec: Adaptive Sequential Recommendation for Reinforcing Long-term
User Engagement [25.18963930580529]
We introduce a novel paradigm called Adaptive Sequential Recommendation (AdaRec) to address this issue.
AdaRec proposes a new distance-based representation loss to extract latent information from users' interaction trajectories.
We conduct extensive empirical analyses in both simulator-based and live sequential recommendation tasks.
arXiv Detail & Related papers (2023-10-06T02:45:21Z) - Generative Slate Recommendation with Reinforcement Learning [49.75985313698214]
reinforcement learning algorithms can be used to optimize user engagement in recommender systems.
However, RL approaches are intractable in the slate recommendation scenario.
In that setting, an action corresponds to a slate that may contain any combination of items.
In this work we propose to encode slates in a continuous, low-dimensional latent space learned by a variational auto-encoder.
We are able to (i) relax assumptions required by previous work, and (ii) improve the quality of the action selection by modeling full slates.
arXiv Detail & Related papers (2023-01-20T15:28:09Z) - Actor Prioritized Experience Replay [0.0]
Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error.
We introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER.
An extensive set of experiments verifies our theoretical claims and demonstrates that the introduced method significantly outperforms the competing approaches.
arXiv Detail & Related papers (2022-09-01T15:27:46Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Improving Long-Term Metrics in Recommendation Systems using
Short-Horizon Offline RL [56.20835219296896]
We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility.
We develop a new batch RL algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced distribution shifts across sessions.
arXiv Detail & Related papers (2021-06-01T15:58:05Z) - Self-Supervised Reinforcement Learning for Recommender Systems [77.38665506495553]
We propose self-supervised reinforcement learning for sequential recommendation tasks.
Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL.
Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC)
arXiv Detail & Related papers (2020-06-10T11:18:57Z) - Contrastive Learning for Debiased Candidate Generation in Large-Scale
Recommender Systems [84.3996727203154]
We show that a popular choice of contrastive loss is equivalent to reducing the exposure bias via inverse propensity weighting.
We further improve upon CLRec and propose Multi-CLRec, for accurate multi-intention aware bias reduction.
Our methods have been successfully deployed in Taobao, where at least four-month online A/B tests and offline analyses demonstrate its substantial improvements.
arXiv Detail & Related papers (2020-05-20T08:15:23Z) - Online Meta-Critic Learning for Off-Policy Actor-Critic Methods [107.98781730288897]
Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks.
We introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor.
arXiv Detail & Related papers (2020-03-11T14:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.