Related papers: Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

URL: http://arxiv.org/abs/2602.00632v1
Date: Sat, 31 Jan 2026 10:02:43 GMT
Title: Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation
Authors: Hongxun Ding, Keqin Bao, Jizhi Zhang, Yi Fang, Wenxin Xu, Fuli Feng, Xiangnan He,
Abstract summary: Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
Score: 56.92367609590823
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs), its adoption for enhancing recommendation quality is growing rapidly. In this work, we critically examine this trend and argue that Long CoT is inherently ill-suited for the sequential recommendation domain. We attribute this misalignment to two primary factors: excessive inference latency and the lack of explicit cognitive reasoning patterns in user behavioral data. Driven by these observations, we propose pivoting away from the CoT structure to directly leverage its underlying mechanism: Reinforcement Learning (RL), to explore the item space. However, applying RL directly faces significant obstacles, notably low sample efficiency-where most actions fail to provide learning signals-and training instability. To overcome these limitations, we propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation. RISER is designed to transform non-learnable trajectories into effective pairwise preference data for optimization. Furthermore, it incorporates specific strategies to ensure stability, including the prevention of redundant rollouts and the constraint of token-level update magnitudes. Extensive experiments on three real-world datasets show that RISER significantly outperforms competitive baselines, establishing a robust paradigm for RL-enhanced LLM recommendation. Our code will be available at https://anonymous.4open.science/r/RISER/.

Related papers

Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning [18.215893951726166]
In environments with sparse or delayed rewards, reinforcement learning incurs high sample complexity.<n>This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance.<n>We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent's own successful rollouts.
arXiv Detail & Related papers (2026-02-20T01:44:35Z)
Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z)
Latent-Space Contrastive Reinforcement Learning for Stable and Efficient LLM Reasoning [16.244366307890832]
We propose textbfDeepLatent Reasoning (DLR), a latent-space bidirectional contrastive reinforcement learning framework.<n>This framework shifts the trial-and-error cost from expensive token-level full sequence generation to the continuous latent manifold.<n> Experiments demonstrate that DLR achieves more stable training convergence, supports longer-horizon reasoning chains, and facilitates the sustainable accumulation of reasoning capabilities.
arXiv Detail & Related papers (2026-01-24T03:18:22Z)
Beyond Static LLM Policies: Imitation-Enhanced Reinforcement Learning for Recommendation [23.945049006150555]
Large language models (LLMs) have become critical tools for enhancing user engagement by delivering personalized content across diverse digital platforms.<n>Direct deployment of LLMs as primary recommendation policies presents notable challenges, including persistent latency issues.<n>This paper proposes a novel offline reinforcement learning framework that leverages imitation learning from LLM-generated trajectories.
arXiv Detail & Related papers (2025-10-15T07:28:29Z)
Reinforced Preference Optimization for Recommendation [28.87206911186567]
We propose Reinforced Preference Optimization for Recommendation (ReRe) for generative recommenders.<n>ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives.<n>We show that ReRe consistently outperforms both traditional and LLM-based recommenders in ranking performance.
arXiv Detail & Related papers (2025-10-14T07:04:33Z)
Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [65.14124923451077]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z)
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [111.1749164063616]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment [51.10604883057508]
We propose DR-IRL (Dynamically adjusting Rewards through Inverse Reinforcement Learning)<n>We first train category-specific reward models using a balanced safety dataset covering seven harmful categories via IRL.<n>Then we enhance Group Relative Policy Optimization (GRPO) by introducing rewards by task difficulty--data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps.
arXiv Detail & Related papers (2025-03-23T16:40:29Z)
A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.<n>Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.<n>This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z)
Robust Reinforcement Learning Objectives for Sequential Recommender Systems [7.44049827436013]
We develop recommender systems that incorporate direct user feedback in the form of rewards, enhancing personalization for users. employing RL algorithms presents challenges, including off-policy training, expansive action spaces, and the scarcity of datasets with sufficient reward signals. We introduce an enhanced methodology aimed at providing a more effective solution to these challenges.
arXiv Detail & Related papers (2023-05-30T08:09:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.