Related papers: Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL

Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL

URL: http://arxiv.org/abs/2602.04089v1
Date: Tue, 03 Feb 2026 23:53:05 GMT
Title: Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL
Authors: Xiaofeng Lin, Sirou Zhu, Yilei Chen, Mingyu Chen, Hejian Sang, Ioannis Paschalidis, Zhipeng Wang, Aldo Pacchiano, Xuezhou Zhang,
Abstract summary: Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront.<n>We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context.<n>After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments.
Score: 28.82521610729606
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.

Related papers

Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning [18.215893951726166]
In environments with sparse or delayed rewards, reinforcement learning incurs high sample complexity.<n>This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance.<n>We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent's own successful rollouts.
arXiv Detail & Related papers (2026-02-20T01:44:35Z)
Experience Scaling: Post-Deployment Evolution For Large Language Models [44.48142891798125]
We propose experience scaling, a framework for continuous post-deployment evolution for large language models (LLMs)<n>We validate the framework in simulated real-world scenarios involving generalization to previously unseen but related tasks, repetitive queries, and over-saturated knowledge stores.<n>Results demonstrate that structured post-deployment learning can extend LLM capabilities beyond the limits of static human-generated data.
arXiv Detail & Related papers (2025-09-23T08:04:58Z)
Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling [66.0871543682453]
We present Omni-Thinker, a unified reinforcement learning framework that scales large language models across diverse tasks.<n>Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance.
arXiv Detail & Related papers (2025-07-20T01:50:16Z)
Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL [62.984693936073974]
Large language models (LLMs) excel in tasks like question answering and dialogue.<n>Complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning.<n>We propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents.
arXiv Detail & Related papers (2025-05-23T16:51:54Z)
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications.<n>Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use [72.32614703504122]
Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with environments.<n>Standard supervised fine-tuning approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use.<n>We propose TL-Training, a task-feature-based framework that mitigates the effects of suboptimal training data.
arXiv Detail & Related papers (2024-12-20T02:21:36Z)
Teaching Models to Improve on Tape [30.330699770714165]
Large Language Models (LLMs) often struggle when prompted to generate content under specific constraints. Recent works have shown that LLMs can benefit from such "corrective feedback" We introduce an RL framework for teaching models to use such rewards, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints.
arXiv Detail & Related papers (2024-11-03T08:49:55Z)
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs) Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z)
Model-Based Reinforcement Learning with Multi-Task Offline Pretraining [59.82457030180094]
We present a model-based RL method that learns to transfer potentially useful dynamics and action demonstrations from offline data to a novel task. The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the task relevance. We demonstrate the advantages of our approach compared with the state-of-the-art methods in Meta-World and DeepMind Control Suite.
arXiv Detail & Related papers (2023-06-06T02:24:41Z)
On Context Distribution Shift in Task Representation Learning for Offline Meta RL [7.8317653074640186]
We focus on context-based OMRL, specifically on the challenge of learning task representation for OMRL. To overcome this problem, we present a hard-sampling-based strategy to train a robust task context encoder.
arXiv Detail & Related papers (2023-04-01T16:21:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.