MiniRec: Data-Efficient Reinforcement Learning for LLM-based Recommendation
- URL: http://arxiv.org/abs/2602.04278v1
- Date: Wed, 04 Feb 2026 07:15:49 GMT
- Title: MiniRec: Data-Efficient Reinforcement Learning for LLM-based Recommendation
- Authors: Lin Wang, Yang Zhang, Jingfan Chen, Xiaoyan Zhao, Fengbin Zhu, Qing Li, Tat-Seng Chua,
- Abstract summary: MiniRec is a data selection framework tailored for RL-based large language models (LLMs) recommendation.<n>It evaluates sample learnability using key RL signals -- rewards -- pruning samples that are too easy (too high reward) or too difficult (consistently low reward)
- Score: 50.417769112326546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The integration of reinforcement learning (RL) into large language models (LLMs) has opened new opportunities for recommender systems by eliciting reasoning and improving user preference modeling. However, RL-based LLM recommendation faces significant efficiency challenges, making full-data training costly. Existing data selection methods define sample value based on learnability or representativeness, yet their loss- or gradient-driven or dataset coverage-driven criteria often misalign with RL learning dynamics, resulting in suboptimal performance. To address this, we propose MiniRec, a data selection framework tailored for RL-based LLM recommendation. MiniRec evaluates sample learnability using key RL signals -- rewards -- pruning samples that are too easy (too high reward) or too difficult (consistently low reward). It assesses representativeness by aligning sample gradients with the approximated "ideal" global RL optimization trajectory, selecting samples that mainly drive model updates, and it also enforces diversity to reduce redundancy. Combined with a curriculum learning strategy from easy to hard samples, MiniRec significantly reduces training cost while largely preserving performance. Extensive experiments demonstrate MiniRec's effectiveness, highlighting the importance of reward-aligned, trajectory-informed data selection in RL-based LLM recommendation.
Related papers
- Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z) - Beyond Static LLM Policies: Imitation-Enhanced Reinforcement Learning for Recommendation [23.945049006150555]
Large language models (LLMs) have become critical tools for enhancing user engagement by delivering personalized content across diverse digital platforms.<n>Direct deployment of LLMs as primary recommendation policies presents notable challenges, including persistent latency issues.<n>This paper proposes a novel offline reinforcement learning framework that leverages imitation learning from LLM-generated trajectories.
arXiv Detail & Related papers (2025-10-15T07:28:29Z) - Sample-efficient LLM Optimization with Reset Replay [13.739451157239756]
We introduce Reset Replay (LoRR), a plugin designed to enhance sample efficiency in any preference-based optimization framework.<n>LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity.<n>Our experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks.
arXiv Detail & Related papers (2025-08-08T15:56:49Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - Evaluating Position Bias in Large Language Model Recommendations [3.430780143519032]
Large Language Models (LLMs) are being increasingly explored as general-purpose tools for recommendation tasks.<n>We show that LLM-based recommendation models suffer from position bias, where the order of candidate items in a prompt can disproportionately influence the recommendations produced by LLMs.<n>We introduce a new prompting strategy to mitigate the position bias of LLM recommendation models called Ranking via Iterative SElection.
arXiv Detail & Related papers (2025-08-04T03:30:26Z) - Direct Preference Optimization for LLM-Enhanced Recommendation Systems [33.54698201942643]
Large Language Models (LLMs) have exhibited remarkable performance across a wide range of domains.<n>We propose DPO4Rec, a framework that integrates DPO into LLM-enhanced recommendation systems.<n>Extensive experiments show that DPO4Rec significantly improves re-ranking performance over strong baselines.
arXiv Detail & Related papers (2024-10-08T11:42:37Z) - VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets.
We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z) - Reinforcement Learning to Rank Using Coarse-grained Rewards [17.09775943683446]
coarse-grained feedback signals are more accessible and affordable.<n>Existing Reinforcement Learning to Rank approaches suffer from high variance and low sample efficiency.<n>We propose new Reinforcement Learning to Rank methods based on widely used RL algorithms for large language models.
arXiv Detail & Related papers (2022-08-16T06:55:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.