Related papers: Blending Imitation and Reinforcement Learning for Robust Policy Improvement

Blending Imitation and Reinforcement Learning for Robust Policy Improvement

URL: http://arxiv.org/abs/2310.01737v2
Date: Wed, 4 Oct 2023 07:28:17 GMT
Title: Blending Imitation and Reinforcement Learning for Robust Policy Improvement
Authors: Xuefeng Liu, Takuma Yoneda, Rick L. Stevens, Matthew R. Walter, Yuxin Chen
Abstract summary: Imitation learning (IL) utilizes oracles to improve sample efficiency. RPI draws on the strengths of IL, using oracle queries to facilitate exploration. RPI is capable of learning from and improving upon a diverse set of black-box oracles.
Score: 16.588397203235296
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. which actively interleaves between IL and RL based on an online estimate of their performance. RPI draws on the strengths of IL, using oracle queries to facilitate exploration, an aspect that is notably challenging in sparse-reward RL, particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner's performance surpasses that of the oracles in a specific state. Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methodologies, demonstrating superior performance across various benchmark domains.

Related papers

Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z)
What Matters for Batch Online Reinforcement Learning in Robotics? [65.06558240091758]
The ability to learn from large batches of autonomously collected data for policy improvement holds the promise of enabling truly scalable robot learning.<n>Previous works have applied imitation learning and filtered imitation learning methods to the batch online RL problem.<n>We analyze how these axes affect performance and scaling with the amount of autonomous data.
arXiv Detail & Related papers (2025-05-12T21:24:22Z)
Effective Inference-Free Retrieval for Learned Sparse Representations [19.54810957623511]
Learned Sparse Retrieval (LSR) is an effective IR approach that exploits pre-trained language models for encoding text into a learned bag of words.<n>Recently, new efficient -- inverted index-based -- retrieval engines have been proposed, leading to a natural question: has the role of regularization changed in training LSR models?<n>We show that regularization can be relaxed to produce more effective LSR encoders.
arXiv Detail & Related papers (2025-04-30T09:10:46Z)
Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization [22.67700436936984]
We introduce Direct Advantage Policy Optimization (DAPO), a novel step-level offline reinforcement learning algorithm. DAPO employs a critic function to predict the reasoning accuracy at each step, thereby generating dense signals to refine the generation strategy. Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.
arXiv Detail & Related papers (2024-12-24T08:39:35Z)
Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [0.9549646359252346]
We propose dynamic Learning Rate for deep Reinforcement Learning (LRRL) LRRL is a meta-learning approach that selects the learning rate based on the agent's performance during training. Our empirical results demonstrate that LRRL can substantially improve the performance of deep RL algorithms.
arXiv Detail & Related papers (2024-10-16T14:15:28Z)
LLMs Are In-Context Reinforcement Learners [30.192422586838997]
Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL) This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation.
arXiv Detail & Related papers (2024-10-07T17:45:00Z)
VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL. We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z)
Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate [4.6659670917171825]
Recurrent reinforcement learning (RL) consists of a context encoder based on recurrent neural networks (RNNs) for unobservable state prediction. Previous RL methods face training stability issues due to the gradient instability of RNNs. We propose Recurrent Off-policy RL with Context-Encoder-Specific Learning Rate (RESeL) to tackle this issue.
arXiv Detail & Related papers (2024-05-24T09:33:47Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL. We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Reinforcement Learning-assisted Evolutionary Algorithm: A Survey and Research Opportunities [63.258517066104446]
Reinforcement learning integrated as a component in the evolutionary algorithm has demonstrated superior performance in recent years. We discuss the RL-EA integration method, the RL-assisted strategy adopted by RL-EA, and its applications according to the existing literature. In the applications of RL-EA section, we also demonstrate the excellent performance of RL-EA on several benchmarks and a range of public datasets.
arXiv Detail & Related papers (2023-08-25T15:06:05Z)
Reinforcement Learning with Stepwise Fairness Constraints [50.538878453547966]
We introduce the study of reinforcement learning with stepwise fairness constraints. We provide learning algorithms with strong theoretical guarantees in regard to policy optimality and fairness violation.
arXiv Detail & Related papers (2022-11-08T04:06:23Z)
Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks. We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
CARL: A Benchmark for Contextual and Adaptive Reinforcement Learning [45.52724876199729]
We present CARL, a collection of well-known RL environments extended to contextual RL problems. We provide first evidence that disentangling representation learning of the states from the policy learning with the context facilitates better generalization.
arXiv Detail & Related papers (2021-10-05T15:04:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.