Related papers: In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior

In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior

URL: http://arxiv.org/abs/2601.03015v1
Date: Tue, 06 Jan 2026 13:41:31 GMT
Title: In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior
Authors: Anaïs Berkes, Vincent Taboga, Donna Vakalis, David Rolnick, Yoshua Bengio,
Abstract summary: In-context reinforcement learning promises fast adaptation to unseen environments without parameter updates.<n>We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time.<n>We prove that SPICE achieves regret-optimal behaviour in both bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories.
Score: 53.21550098214227
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In-context reinforcement learning (ICRL) promises fast adaptation to unseen environments without parameter updates, but current methods either cannot improve beyond the training distribution or require near-optimal data, limiting practical adoption. We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time using in-context information through Bayesian updates. To recover from poor priors resulting from training on sub-optimal data, our online inference follows an Upper-Confidence Bound rule that favours exploration and adaptation. We prove that SPICE achieves regret-optimal behaviour in both stochastic bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories. We validate these findings empirically across bandit and control benchmarks. SPICE achieves near-optimal decisions on unseen tasks, substantially reduces regret compared to prior ICRL and meta-RL approaches while rapidly adapting to unseen tasks and remaining robust under distribution shift.

Related papers

Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z)
Zero-Shot Off-Policy Learning [9.729890516322781]
Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions.<n>In this work, we address the off-policy problem in a zero-shot setting by discovering a theoretical connection of successor measures to stationary density ratios.<n>Our algorithm can infer optimal importance sampling ratios, effectively performing a stationary distribution correction with an optimal policy for any task on the fly.
arXiv Detail & Related papers (2026-02-02T11:06:31Z)
A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z)
Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z)
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z)
Towards Monotonic Improvement in In-Context Reinforcement Learning [18.67894044930047]
In-Context Reinforcement Learning (ICRL) has emerged as a promising paradigm for developing agents that can rapidly adapt to new tasks.<n>Recent approaches train large sequence models on monotonic policy improvement data from online RL, aiming to a continue improved testing time performance.<n>We propose two methods for estimating Context Value at both training and testing time.
arXiv Detail & Related papers (2025-09-27T09:42:19Z)
Adaptive Policy Backbone via Shared Network [7.589048964013273]
Reinforcement learning (RL) has achieved impressive results across domains, yet learning an optimal policy typically requires extensive interaction data.<n>We propose Adaptive Policy Backbone (APB), a meta-transfer RL method that inserts lightweight linear layers before and after a shared backbone.
arXiv Detail & Related papers (2025-09-26T13:14:03Z)
Value Function Initialization for Knowledge Transfer and Jump-start in Deep Reinforcement Learning [0.0]
We introduce DQInit, a method that adapts value function initialization to deep reinforcement learning.<n>DQInit reuses compact Q-values extracted from previously solved tasks as a transferable knowledge base.<n>It employs a knownness-based mechanism to softly integrate these transferred values into underexplored regions and gradually shift toward the agent's learned estimates.
arXiv Detail & Related papers (2025-08-12T18:32:08Z)
Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.<n>Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.<n>We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z)
Scalable PAC-Bayesian Meta-Learning via the PAC-Optimal Hyper-Posterior: From Theory to Practice [54.03076395748459]
A central question in the meta-learning literature is how to regularize to ensure generalization to unseen tasks. We present a generalization bound for meta-learning, which was first derived by Rothfuss et al. We provide a theoretical analysis and empirical case study under which conditions and to what extent these guarantees for meta-learning improve upon PAC-Bayesian per-task learning bounds.
arXiv Detail & Related papers (2022-11-14T08:51:04Z)
BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets. Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions. We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.