Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates
- URL: http://arxiv.org/abs/2601.18510v1
- Date: Mon, 26 Jan 2026 14:16:51 GMT
- Title: Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates
- Authors: Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi,
- Abstract summary: We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.<n>JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.<n>Experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
- Score: 53.3717573880076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.
Related papers
- MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance [18.215893951726166]
Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning.<n>We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training.
arXiv Detail & Related papers (2026-02-20T01:43:30Z) - LACONIC: Length-Aware Constrained Reinforcement Learning for LLM [29.383977698780374]
LACONIC is a reinforcement learning method that enforces a target token budget during training.<n>It preserves or improves pass@1 while reducing output length by over 50%.<n>It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens.
arXiv Detail & Related papers (2026-02-16T05:09:40Z) - Training-Free Group Relative Policy Optimization [34.73950078782136]
We argue that Large Language Model (LLM) agents can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior.<n>We propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates.<n> Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance.
arXiv Detail & Related papers (2025-10-09T13:18:17Z) - Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z) - Memento: Fine-tuning LLM Agents without Fine-tuning LLMs [36.3424780932712]
We introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents.<n>Our method enables low-cost continual adaptation via memory-based online reinforcement learning.<n>We instantiate our agent model in the deep research setting, namely emphMemento, which attains top-1 on GAIA validation.
arXiv Detail & Related papers (2025-08-22T07:25:30Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - $Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training [63.602824642605775]
$Qsharp$ is a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function.<n>Our results highlight $Qsharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees.
arXiv Detail & Related papers (2025-02-27T21:43:00Z) - VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework [1.5802986215292307]
Language Model Guided reward Tuning (LMGT) is a novel, sample-efficient framework for Reinforcement Learning.<n>We show that LMGT adeptly balances exploration and exploitation, thereby guiding the agent's exploratory behavior.<n>Our results suggest that LMGT can substantially reduce the computational resources required during the RL training phase.
arXiv Detail & Related papers (2024-09-07T07:40:43Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning [50.9692060692705]
This paper introduces $textbfLanguage Models for $textbfMo$tion Control ($textbfLaMo$), a general framework based on Decision Transformers for offline RL.<n>Our framework highlights four crucial components:.<n>Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method,.<n>In particular, our method demonstrates superior performance in scenarios with limited data samples.
arXiv Detail & Related papers (2023-10-31T16:24:17Z) - RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$ [12.111848705677142]
We propose RL$3$, a hybrid approach that incorporates action-values, learned per task via traditional RL, in the inputs to Meta-RL.<n>We show that RL$3$ earns a greater cumulative reward in the long term compared to RL$2$ while drastically reducing meta-training time and generalizes better to out-of-distribution tasks.
arXiv Detail & Related papers (2023-06-28T04:16:16Z) - Off-Policy Meta-Reinforcement Learning Based on Feature Embedding Spaces [14.029933823101084]
We propose a novel off-policy meta-RL method, embedding learning and evaluation of uncertainty (ELUE)
ELUE learns a belief model over the embedding space and a belief-conditional policy and Q-function.
We demonstrate that ELUE outperforms state-of-the-art meta RL methods through experiments on meta-RL benchmarks.
arXiv Detail & Related papers (2021-01-06T05:51:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.