Related papers: Adversarial Batch Inverse Reinforcement Learning: Learn to Reward from Imperfect Demonstration for Interactive Recommendation

Adversarial Batch Inverse Reinforcement Learning: Learn to Reward from Imperfect Demonstration for Interactive Recommendation

URL: http://arxiv.org/abs/2310.19536v1
Date: Mon, 30 Oct 2023 13:43:20 GMT
Title: Adversarial Batch Inverse Reinforcement Learning: Learn to Reward from Imperfect Demonstration for Interactive Recommendation
Authors: Jialin Liu, Xinyan Su, Zeyu He, Xiangyu Zhao, Jun Li
Abstract summary: We focus on the problem of learning to reward, which is fundamental to reinforcement learning. Previous approaches either introduce additional procedures for learning to reward, thereby increasing the complexity of optimization. We propose a novel batch inverse reinforcement learning paradigm that achieves the desired properties.
Score: 23.048841953423846
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Rewards serve as a measure of user satisfaction and act as a limiting factor in interactive recommender systems. In this research, we focus on the problem of learning to reward (LTR), which is fundamental to reinforcement learning. Previous approaches either introduce additional procedures for learning to reward, thereby increasing the complexity of optimization, or assume that user-agent interactions provide perfect demonstrations, which is not feasible in practice. Ideally, we aim to employ a unified approach that optimizes both the reward and policy using compositional demonstrations. However, this requirement presents a challenge since rewards inherently quantify user feedback on-policy, while recommender agents approximate off-policy future cumulative valuation. To tackle this challenge, we propose a novel batch inverse reinforcement learning paradigm that achieves the desired properties. Our method utilizes discounted stationary distribution correction to combine LTR and recommender agent evaluation. To fulfill the compositional requirement, we incorporate the concept of pessimism through conservation. Specifically, we modify the vanilla correction using Bellman transformation and enforce KL regularization to constrain consecutive policy updates. We use two real-world datasets which represent two compositional coverage to conduct empirical studies, the results also show that the proposed method relatively improves both effectiveness (2.3\%) and efficiency (11.53\%)

Related papers

Value Function Decomposition in Markov Recommendation Process [19.082512423102855]
We propose an online reinforcement learning framework to improve recommender performance. We show that these two factors can be separately approximated by decomposing the original temporal difference loss. The disentangled learning framework can achieve a more accurate estimation with faster learning and improved robustness against action exploration.
arXiv Detail & Related papers (2025-01-29T04:22:29Z)
A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts. With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS) Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements. High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards [38.056359612828466]
We propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro) We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway)
arXiv Detail & Related papers (2024-10-08T08:04:09Z)
Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards [26.40009657912622]
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources. In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as textitcontrastive rewards
arXiv Detail & Related papers (2024-03-12T14:51:57Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
A State Augmentation based approach to Reinforcement Learning from Human Preferences [20.13307800821161]
Preference Based Reinforcement Learning attempts to solve the issue by utilizing binary feedbacks on queried trajectory pairs. We present a state augmentation technique that allows the agent's reward model to be robust.
arXiv Detail & Related papers (2023-02-17T07:10:50Z)
SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z)
Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z)
PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z)
Offline Meta-level Model-based Reinforcement Learning Approach for Cold-Start Recommendation [27.17948754183511]
Reinforcement learning has shown great promise in optimizing long-term user interest in recommender systems. Existing RL-based recommendation methods need a large number of interactions for each user to learn a robust recommendation policy. We propose a meta-level model-based reinforcement learning approach for fast user adaptation.
arXiv Detail & Related papers (2020-12-04T08:58:35Z)
Generative Inverse Deep Reinforcement Learning for Online Recommendation [62.09946317831129]
We propose a novel inverse reinforcement learning approach, namely InvRec, for online recommendation. InvRec extracts the reward function from user's behaviors automatically, for online recommendation.
arXiv Detail & Related papers (2020-11-04T12:12:25Z)
Self-Supervised Reinforcement Learning for Recommender Systems [77.38665506495553]
We propose self-supervised reinforcement learning for sequential recommendation tasks. Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL. Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC)
arXiv Detail & Related papers (2020-06-10T11:18:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.