Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers
- URL: http://arxiv.org/abs/2602.08244v1
- Date: Mon, 09 Feb 2026 03:42:16 GMT
- Title: Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers
- Authors: Juncheng Dong, Bowen He, Moyang Guo, Ethan X. Fang, Zhuoran Yang, Vahid Tarokh,
- Abstract summary: We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
- Score: 55.33468902405567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context reinforcement learning (ICRL) leverages the in-context learning capabilities of transformer models (TMs) to efficiently generalize to unseen sequential decision-making tasks without parameter updates. However, existing ICRL methods rely on explicit reward signals during pretraining, which limits their applicability when rewards are ambiguous, hard to specify, or costly to obtain. To overcome this limitation, we propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback, eliminating the need for reward supervision. We study two variants that differ in the granularity of feedback: Immediate Preference-based RL (I-PRL) with per-step preferences, and Trajectory Preference-based RL (T-PRL) with trajectory-level comparisons. We first show that supervised pretraining, a standard approach in ICRL, remains effective under preference-only context datasets, demonstrating the feasibility of in-context reinforcement learning using only preference signals. To further improve data efficiency, we introduce alternative preference-native frameworks for I-PRL and T-PRL that directly optimize TM policies from preference data without requiring reward signals nor optimal action labels.Experiments on dueling bandits, navigation, and continuous control tasks demonstrate that ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
Related papers
- MiniRec: Data-Efficient Reinforcement Learning for LLM-based Recommendation [50.417769112326546]
MiniRec is a data selection framework tailored for RL-based large language models (LLMs) recommendation.<n>It evaluates sample learnability using key RL signals -- rewards -- pruning samples that are too easy (too high reward) or too difficult (consistently low reward)
arXiv Detail & Related papers (2026-02-04T07:15:49Z) - Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior [53.21550098214227]
In-context reinforcement learning promises fast adaptation to unseen environments without parameter updates.<n>We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time.<n>We prove that SPICE achieves regret-optimal behaviour in both bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories.
arXiv Detail & Related papers (2026-01-06T13:41:31Z) - Beyond Static LLM Policies: Imitation-Enhanced Reinforcement Learning for Recommendation [23.945049006150555]
Large language models (LLMs) have become critical tools for enhancing user engagement by delivering personalized content across diverse digital platforms.<n>Direct deployment of LLMs as primary recommendation policies presents notable challenges, including persistent latency issues.<n>This paper proposes a novel offline reinforcement learning framework that leverages imitation learning from LLM-generated trajectories.
arXiv Detail & Related papers (2025-10-15T07:28:29Z) - DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning [37.20873499361773]
We propose a unified framework for training masked diffusion large language models (dLLMs) to reason better (furious)<n>We first unify the existing baseline approach by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an approximation to the true dLLM policy.<n>We also propose a new direction of joint training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs' natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt.
arXiv Detail & Related papers (2025-10-02T16:57:24Z) - Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z) - Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z) - Diffusion Guidance Is a Controllable Policy Improvement Operator [98.11511661904618]
CFGRL is trained with the simplicity of supervised learning, yet can further improve on the policies in the data.<n>On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance.
arXiv Detail & Related papers (2025-05-29T14:06:50Z) - A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning [45.19254609437857]
Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability.<n>Data attribution provides a principled way to trace model behavior back to training samples.<n>We propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates.
arXiv Detail & Related papers (2025-05-25T19:25:57Z) - Policy-labeled Preference Learning: Is Preference Enough for RLHF? [8.378137704007038]
We propose policy-labeled preference learning (PPL) to resolve likelihood mismatch issues by modeling human preferences with regret, which reflects behavior policy information.<n>Experiments in high-dimensional continuous control tasks demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.
arXiv Detail & Related papers (2025-05-06T15:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.