Related papers: Reward Balancing Revisited: Enhancing Offline Reinforcement Learning for Recommender Systems

Reward Balancing Revisited: Enhancing Offline Reinforcement Learning for Recommender Systems

URL: http://arxiv.org/abs/2506.22112v2
Date: Mon, 30 Jun 2025 06:57:33 GMT
Title: Reward Balancing Revisited: Enhancing Offline Reinforcement Learning for Recommender Systems
Authors: Wenzheng Shu, Yanxiang Zeng, Yongxiang Tang, Teng Sha, Ning Luo, Yanhua Cheng, Xialong Liu, Fan Zhou, Peng Jiang,
Abstract summary: We present an innovative offline RL framework termed Reallocated Reward for Recommender Systems (R3S)<n>By integrating inherent model uncertainty to tackle the intrinsic fluctuations in reward predictions, we boost diversity for decision-making to align with a more interactive paradigm.<n>The experimental results demonstrate that R3S improves the accuracy of world models and efficiently harmonizes the heterogeneous preferences of the users.
Score: 10.995830376373801
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline reinforcement learning (RL) has emerged as a prevalent and effective methodology for real-world recommender systems, enabling learning policies from historical data and capturing user preferences. In offline RL, reward shaping encounters significant challenges, with past efforts to incorporate prior strategies for uncertainty to improve world models or penalize underexplored state-action pairs. Despite these efforts, a critical gap remains: the simultaneous balancing of intrinsic biases in world models and the diversity of policy recommendations. To address this limitation, we present an innovative offline RL framework termed Reallocated Reward for Recommender Systems (R3S). By integrating inherent model uncertainty to tackle the intrinsic fluctuations in reward predictions, we boost diversity for decision-making to align with a more interactive paradigm, incorporating extra penalizers with decay that deter actions leading to diminished state variety at both local and global scales. The experimental results demonstrate that R3S improves the accuracy of world models and efficiently harmonizes the heterogeneous preferences of the users.

Related papers

Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z)
Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z)
Fine-Tuning Diffusion-Based Recommender Systems via Reinforcement Learning with Reward Function Optimization [21.769717387197943]
Diffusion models offer state-of-the-art performance by modeling the generative process of user-item interactions.<n>We propose ReFiT, a new framework that integrates Reinforcement learning (RL)-based Fine-Tuning into diffusion-based recommender systems.
arXiv Detail & Related papers (2025-11-10T10:38:16Z)
DARLR: Dual-Agent Offline Reinforcement Learning for Recommender Systems with Dynamic Reward [14.323631574821123]
Model-based offline reinforcement learning has emerged as a promising approach for recommender systems.<n>DarLR is proposed to dynamically update world models to enhance recommendation policies.<n>Experiments on four benchmark datasets demonstrate the superior performance of DARLR.
arXiv Detail & Related papers (2025-05-12T06:18:31Z)
Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator [50.191655141020505]
Reinforcement Learning (RL) has demonstrated impressive capabilities in robotic control but remains challenging due to high sample complexity, safety concerns, and the sim-to-real gap.<n>We introduce Offline Robotic World Model (RWM-O), a model-based approach that explicitly estimates uncertainty to improve policy learning without reliance on a physics simulator.
arXiv Detail & Related papers (2025-04-23T12:58:15Z)
Model-Based Offline Reinforcement Learning with Adversarial Data Augmentation [36.9134885948595]
We introduce Model-based Offline Reinforcement learning with AdversariaL data augmentation.<n>In MORAL, we replace the fixed horizon rollout by employing adversaria data augmentation to execute alternating sampling with ensemble models.<n>Experiments on D4RL benchmark demonstrate that MORAL outperforms other model-based offline RL methods in terms of policy learning and sample efficiency.
arXiv Detail & Related papers (2025-03-26T07:24:34Z)
Efficient and Robust Regularized Federated Recommendation [52.24782464815489]
The recommender system (RSRS) addresses both user preference and privacy concerns. We propose a novel method that incorporates non-uniform gradient descent to improve communication efficiency. RFRecF's superior robustness compared to diverse baselines.
arXiv Detail & Related papers (2024-11-03T12:10:20Z)
ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems [14.74207332728742]
offline reinforcement learning (RL) is an effective tool for real-world recommender systems.<n>This paper proposes a novel model-based Reward Shaping in Offline Reinforcement Learning for Recommender Systems, ROLeR, for reward and uncertainty estimation.
arXiv Detail & Related papers (2024-07-18T05:07:11Z)
Hybrid Reinforcement Learning for Optimizing Pump Sustainability in Real-World Water Distribution Networks [55.591662978280894]
This article addresses the pump-scheduling optimization problem to enhance real-time control of real-world water distribution networks (WDNs) Our primary objectives are to adhere to physical operational constraints while reducing energy consumption and operational costs. Traditional optimization techniques, such as evolution-based and genetic algorithms, often fall short due to their lack of convergence guarantees.
arXiv Detail & Related papers (2023-10-13T21:26:16Z)
Robust Reinforcement Learning Objectives for Sequential Recommender Systems [7.44049827436013]
We develop recommender systems that incorporate direct user feedback in the form of rewards, enhancing personalization for users. employing RL algorithms presents challenges, including off-policy training, expansive action spaces, and the scarcity of datasets with sufficient reward signals. We introduce an enhanced methodology aimed at providing a more effective solution to these challenges.
arXiv Detail & Related papers (2023-05-30T08:09:08Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world. It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z)
Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets. We propose a novel approach, which we refer to as adaptive behavior regularization (ABR) ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z)
Choosing the Best of Both Worlds: Diverse and Novel Recommendations through Multi-Objective Reinforcement Learning [68.45370492516531]
We introduce Scalarized Multi-Objective Reinforcement Learning (SMORL) for the Recommender Systems (RS) setting. SMORL agent augments standard recommendation models with additional RL layers that enforce it to simultaneously satisfy three principal objectives: accuracy, diversity, and novelty of recommendations. Our experimental results on two real-world datasets reveal a substantial increase in aggregate diversity, a moderate increase in accuracy, reduced repetitiveness of recommendations, and demonstrate the importance of reinforcing diversity and novelty as complementary objectives.
arXiv Detail & Related papers (2021-10-28T13:22:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.