Related papers: Retentive Decision Transformer with Adaptive Masking for Reinforcement Learning based Recommendation Systems

Retentive Decision Transformer with Adaptive Masking for Reinforcement Learning based Recommendation Systems

URL: http://arxiv.org/abs/2403.17634v1
Date: Tue, 26 Mar 2024 12:08:58 GMT
Title: Retentive Decision Transformer with Adaptive Masking for Reinforcement Learning based Recommendation Systems
Authors: Siyu Wang, Xiaocong Chen, Lina Yao,
Abstract summary: Reinforcement Learning-based Recommender Systems (RLRS) have shown promise across a spectrum of applications. Yet, they grapple with challenges, notably in crafting reward functions and harnessing large pre-existing datasets. Recent advancements in offline RLRS provide a solution for how to address these two challenges.
Score: 17.750449033873036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning-based Recommender Systems (RLRS) have shown promise across a spectrum of applications, from e-commerce platforms to streaming services. Yet, they grapple with challenges, notably in crafting reward functions and harnessing large pre-existing datasets within the RL framework. Recent advancements in offline RLRS provide a solution for how to address these two challenges. However, existing methods mainly rely on the transformer architecture, which, as sequence lengths increase, can introduce challenges associated with computational resources and training costs. Additionally, the prevalent methods employ fixed-length input trajectories, restricting their capacity to capture evolving user preferences. In this study, we introduce a new offline RLRS method to deal with the above problems. We reinterpret the RLRS challenge by modeling sequential decision-making as an inference task, leveraging adaptive masking configurations. This adaptive approach selectively masks input tokens, transforming the recommendation task into an inference challenge based on varying token subsets, thereby enhancing the agent's ability to infer across diverse trajectory lengths. Furthermore, we incorporate a multi-scale segmented retention mechanism that facilitates efficient modeling of long sequences, significantly enhancing computational efficiency. Our experimental analysis, conducted on both online simulator and offline datasets, clearly demonstrates the advantages of our proposed method.

Related papers

MTFM: A Scalable and Alignment-free Foundation Model for Industrial Recommendation in Meituan [28.96814648857816]
We propose MTFM (Meituan Foundation Model for Recommendation), a transformer-based framework that addresses these challenges.<n>Instead of pre-aligning inputs, MTFM transforms cross-domain data into heterogeneous tokens, capturing multi-scenario knowledge in an alignment-free manner.
arXiv Detail & Related papers (2026-02-11T16:37:29Z)
Fints: Efficient Inference-Time Personalization for LLMs with Fine-Grained Instance-Tailored Steering [49.212940215720884]
We propose a steering framework that generates sample-level interference from user data and injects it into the model's forward pass for personalized adaptation.<n>Our method significantly enhances personalization performance in fast-shifting environments while maintaining robustness across varying interaction modes and context lengths.
arXiv Detail & Related papers (2025-10-31T06:01:04Z)
Generative Sequential Notification Optimization via Multi-Objective Decision Transformers [9.542285455613927]
We present a Decision Transformer based framework that reframes policy learning as return-conditioned supervised learning.<n>Our contributions include a real-world comparison with CQL, a multi-reward design suitable for non-episodic tasks, and a quantile regression approach to return-to-go conditioning.
arXiv Detail & Related papers (2025-09-02T16:09:02Z)
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay [61.823835392216544]
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs)<n>We propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay.<n>Our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.
arXiv Detail & Related papers (2025-06-05T17:55:43Z)
Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z)
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [68.8373788348678]
Visual instruction tuning adapts pre-trained Multimodal Large Language Models to follow human instructions.<n>PRISM is the first training-free framework for efficient visual instruction selection.<n>It reduces the end-to-end time for data selection and model tuning to just 30% of conventional pipelines.
arXiv Detail & Related papers (2025-02-17T18:43:41Z)
Provably Efficient Online RLHF with One-Pass Reward Modeling [59.30310692855397]
We propose a one-pass reward modeling method that does not require storing the historical data and can be computed in constant time.<n>We provide theoretical guarantees showing that our method improves both statistical and computational efficiency.<n>We conduct experiments using Llama-3-8B-Instruct and Qwen2.5-7B-Instruct models on the Ultrafeedback-binarized and Mixture2 datasets.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization [50.485788083202124]
Reinforcement Learning (RL) plays a crucial role in aligning large language models with human preferences and improving their ability to perform complex tasks. We introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
arXiv Detail & Related papers (2024-10-11T23:29:20Z)
Offline Reinforcement Learning for Learning to Dispatch for Job Shop Scheduling [0.9831489366502301]
Job Shop Scheduling Problem (JSSP) is a complex optimization problem. Online Reinforcement Learning (RL) has shown promise by quickly finding acceptable solutions for JSSP. We introduce Offline Reinforcement Learning for Learning to Dispatch (Offline-LD)
arXiv Detail & Related papers (2024-09-16T15:18:10Z)
Causal prompting model-based offline reinforcement learning [16.95292725275873]
Model-based offline RL allows agents to fully utilise pre-collected datasets without requiring additional or unethical explorations. Applying model-based offline RL to online systems presents challenges due to the highly suboptimal (noise-filled) and diverse nature of datasets generated by online systems. We introduce the Causal Prompting Reinforcement Learning framework, designed for highly suboptimal and resource-constrained online scenarios.
arXiv Detail & Related papers (2024-06-03T07:28:57Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Robust Reinforcement Learning Objectives for Sequential Recommender Systems [7.44049827436013]
We develop recommender systems that incorporate direct user feedback in the form of rewards, enhancing personalization for users. employing RL algorithms presents challenges, including off-policy training, expansive action spaces, and the scarcity of datasets with sufficient reward signals. We introduce an enhanced methodology aimed at providing a more effective solution to these challenges.
arXiv Detail & Related papers (2023-05-30T08:09:08Z)
End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online. We extensively ablate these design choices, demonstrating the key factors that most affect performance. We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z)
Value Penalized Q-Learning for Recommender Systems [30.704083806571074]
Scaling reinforcement learning to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS. A key approach to this goal is offline RL, which aims to learn policies from logged data. We propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL algorithm.
arXiv Detail & Related papers (2021-10-15T08:08:28Z)
Reconfigurable Intelligent Surface Assisted Mobile Edge Computing with Heterogeneous Learning Tasks [53.1636151439562]
Mobile edge computing (MEC) provides a natural platform for AI applications. We present an infrastructure to perform machine learning tasks at an MEC with the assistance of a reconfigurable intelligent surface (RIS) Specifically, we minimize the learning error of all participating users by jointly optimizing transmit power of mobile users, beamforming vectors of the base station, and the phase-shift matrix of the RIS.
arXiv Detail & Related papers (2020-12-25T07:08:50Z)
Optimization-driven Machine Learning for Intelligent Reflecting Surfaces Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts. Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity. In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z)
The reinforcement learning-based multi-agent cooperative approach for the adaptive speed regulation on a metallurgical pickling line [0.0]
The proposed approach combines mathematical modeling as a base algorithm and a cooperative Multi-Agent Reinforcement Learning system. We demonstrate how Deep Q-Learning can be applied to a real-life task in a heavy industry, resulting in significant improvement of previously existing automation systems.
arXiv Detail & Related papers (2020-08-16T15:10:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.