Related papers: Filtering Learning Histories Enhances In-Context Reinforcement Learning

Filtering Learning Histories Enhances In-Context Reinforcement Learning

URL: http://arxiv.org/abs/2505.15143v1
Date: Wed, 21 May 2025 06:00:41 GMT
Title: Filtering Learning Histories Enhances In-Context Reinforcement Learning
Authors: Weiqin Chen, Xinjie Zhang, Dharmashankar Subramanian, Santiago Paternain,
Abstract summary: Transformer models (TMs) have exhibited remarkable in-context reinforcement learning capabilities.<n>We propose a simple yet effective approach, learning history filtering (LHF) to enhance ICRL.<n>LHF is the first approach to avoid source suboptimality by dataset preprocessing.
Score: 12.697029805927398
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer models (TMs) have exhibited remarkable in-context reinforcement learning (ICRL) capabilities, allowing them to generalize to and improve in previously unseen environments without re-training or fine-tuning. This is typically accomplished by imitating the complete learning histories of a source RL algorithm over a substantial amount of pretraining environments, which, however, may transfer suboptimal behaviors inherited from the source algorithm/dataset. Therefore, in this work, we address the issue of inheriting suboptimality from the perspective of dataset preprocessing. Motivated by the success of the weighted empirical risk minimization, we propose a simple yet effective approach, learning history filtering (LHF), to enhance ICRL by reweighting and filtering the learning histories based on their improvement and stability characteristics. To the best of our knowledge, LHF is the first approach to avoid source suboptimality by dataset preprocessing, and can be combined with the current state-of-the-art (SOTA) ICRL algorithms. We substantiate the effectiveness of LHF through a series of experiments conducted on the well-known ICRL benchmarks, encompassing both discrete environments and continuous robotic manipulation tasks, with three SOTA ICRL algorithms (AD, DPT, DICP) as the backbones. LHF exhibits robust performance across a variety of suboptimal scenarios, as well as under varying hyperparameters and sampling strategies. Notably, the superior performance of LHF becomes more pronounced in the presence of noisy data, indicating the significance of filtering learning histories.

Related papers

Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning [29.513882808306406]
We introduce the Adaptive Replay Buffer (ARB), a novel approach that prioritizes data sampling based on a lightweight metric we call 'on-policyness'<n>ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing Offline-to-Online Reinforcement Learning algorithms.<n>Our experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms.
arXiv Detail & Related papers (2025-12-11T10:30:04Z)
Data-Efficient RLVR via Off-Policy Influence Guidance [84.60336960383867]
This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective.<n>We develop textbfCurriculum textbfRL with textbfOff-textbfPolicy textInfluence guidance (textbfCROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy.
arXiv Detail & Related papers (2025-10-30T13:40:52Z)
Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z)
Sample-efficient LLM Optimization with Reset Replay [13.739451157239756]
We introduce Reset Replay (LoRR), a plugin designed to enhance sample efficiency in any preference-based optimization framework.<n>LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity.<n>Our experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks.
arXiv Detail & Related papers (2025-08-08T15:56:49Z)
KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG [63.82127103851471]
Retrieval-Augmented Generation (RAG) enables large language models to access broader knowledge sources.<n>We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance.<n>We present KARE-RAG, which improves knowledge utilization through three key innovations.
arXiv Detail & Related papers (2025-06-03T06:31:17Z)
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning [45.19254609437857]
Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability.<n>Data attribution provides a principled way to trace model behavior back to training samples.<n>We propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates.
arXiv Detail & Related papers (2025-05-25T19:25:57Z)
Provably Efficient Online RLHF with One-Pass Reward Modeling [70.82499103200402]
Reinforcement Learning from Human Feedback has shown remarkable success in aligning Large Language Models with human preferences.<n>Online RLHF has emerged as a promising direction, enabling iterative data collection and refinement.<n>We propose a one-pass reward modeling method that eliminates the need to store historical data and achieves constant-time updates per iteration.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning [62.984693936073974]
Value-based reinforcement learning can learn effective policies for a wide range of multi-turn problems.<n>Current value-based RL methods have proven particularly challenging to scale to the setting of large language models.<n>We propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning problem.
arXiv Detail & Related papers (2024-11-07T21:36:52Z)
Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [0.9549646359252346]
In deep Reinforcement Learning (RL) models trained using gradient-based techniques, the choice of gradient and its learning rate are crucial to achieving good performance.<n>We propose dynamic Learning Rate for deep Reinforcement Learning (LRRL), a meta-learning approach that selects the learning rate based on the agent's performance during training.
arXiv Detail & Related papers (2024-10-16T14:15:28Z)
A Distribution-Aware Flow-Matching for Generating Unstructured Data for Few-Shot Reinforcement Learning [1.0709300917082865]
We introduce a distribution-aware flow matching approach to generate synthetic unstructured data for few-shot reinforcement learning.<n>Our approach addresses key challenges in traditional model-based RL, such as overfitting and data correlation.<n>Results demonstrate that our method achieves stable convergence in terms of maximum Q-value while enhancing frame rates by 30% in the initial timestamps.
arXiv Detail & Related papers (2024-09-21T15:50:59Z)
CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning [25.071018803326254]
Distribution shift is a major obstacle in offline reinforcement learning. Previous conservative offline RL algorithms struggle to generalize to unseen actions. We propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions.
arXiv Detail & Related papers (2024-06-11T17:59:29Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
Task Aware Modulation using Representation Learning: An Approach for Few Shot Learning in Environmental Systems [15.40286222692196]
TAM-RL is a novel framework for few-shot learning in heterogeneous systems. We evaluate TAM-RL on two real-world environmental datasets.
arXiv Detail & Related papers (2023-10-07T07:55:22Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR) We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.