Related papers: RAPID: An Efficient Reinforcement Learning Algorithm for Small Language Models

RAPID: An Efficient Reinforcement Learning Algorithm for Small Language Models

URL: http://arxiv.org/abs/2510.03515v1
Date: Fri, 03 Oct 2025 20:58:49 GMT
Title: RAPID: An Efficient Reinforcement Learning Algorithm for Small Language Models
Authors: Lianghuan Huang, Sagnik Anupam, Insup Lee, Shuo Li, Osbert Bastani,
Abstract summary: Reinforcement learning (RL) has emerged as a promising strategy for finetuning small language models (SLMs) to solve targeted tasks such as math and coding.<n>RL algorithms tend to be resource-intensive, taking a significant amount of time to train.<n>We propose a novel RL algorithm that can substantially reduce the running time of RL.
Score: 27.643632808936403
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has emerged as a promising strategy for finetuning small language models (SLMs) to solve targeted tasks such as math and coding. However, RL algorithms tend to be resource-intensive, taking a significant amount of time to train. We propose RAPID, a novel RL algorithm that can substantially reduce the running time of RL. Our key insight is that RL tends to be costly due to the need to perform both inference and backpropagation during training. To maximize use of computational resources, our algorithm performs inference in large batches, and then performs off-policy policy gradient updates in mini-batches. For off-policy updates, we incorporate group advantage estimation into the policy gradient algorithm, and derive an importance weighted estimator to correct for the bias arising from off-policy learning. Our experiments demonstrate that our algorithm can reduce running time by 11%-34% on three benchmarks compared to state-of-the-art RL algorithms while maintaining similar or better accuracy.

Related papers

Transitive RL: Value Learning via Divide and Conquer [54.190627631246166]
Transitive Reinforcement Learning (TRL) is a new value learning algorithm based on a divide-and-conquer paradigm.<n>Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming.
arXiv Detail & Related papers (2025-10-26T03:32:31Z)
Learning to Reason as Action Abstractions with Scalable Mid-Training RL [55.24192942739207]
An effective mid-training phase should identify a compact set of useful actions and enable fast selection.<n>We propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm.
arXiv Detail & Related papers (2025-09-30T05:34:20Z)
How Should We Meta-Learn Reinforcement Learning Algorithms? [69.12853522797188]
We carry out an empirical comparison of the different approaches when applied to a range of meta-learned algorithms.<n>In addition to meta-train and meta-test performance, we also investigate factors including the interpretability, sample cost and train time.<n>We propose several guidelines for meta-learning new RL algorithms which will help ensure that future learned algorithms are as performant as possible.
arXiv Detail & Related papers (2025-07-23T16:31:38Z)
Effective Reinforcement Learning for Reasoning in Language Models [30.994610715391776]
Reinforcement learning (RL) has emerged as a promising strategy for improving the reasoning capabilities of language models (LMs) in domains such as mathematics and coding.<n>We analyze RL algorithm design decisions for LM reasoning, focusing on relatively small models due to computational constraints.<n>Our findings are: (i) on-policy RL significantly outperforms supervised fine-tuning (SFT), (ii) PPO-based off-policy updates increase accuracy instead of reduce variance, and (iii) removing KL divergence can lead to more concise generations and higher accuracy.
arXiv Detail & Related papers (2025-05-22T18:48:09Z)
Snapshot Reinforcement Learning: Leveraging Prior Trajectories for Efficiency [6.267119107674013]
Deep reinforcement learning (DRL) algorithms require substantial samples and computational resources to achieve higher performance. We present the Snapshot Reinforcement Learning framework, which enhances sample efficiency by simply altering environments. We propose a simple and effective SnapshotRL baseline algorithm, S3RL, which integrates well with existing DRL algorithms.
arXiv Detail & Related papers (2024-03-01T17:05:22Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks. We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning [43.562783189118]
We introduce a practical algorithm for incorporating human insight to speed learning. Our algorithm, Constraint Sampling Reinforcement Learning (CSRL), incorporates prior domain knowledge as constraints/restrictions on the RL policy. In all cases, CSRL learns a good policy faster than baselines.
arXiv Detail & Related papers (2021-12-30T22:02:42Z)
Deep Reinforcement Learning with Adjustments [10.244120641608447]
We propose a new Q-learning algorithm for continuous action space, which can bridge the control and RL algorithms. Our method can learn complex policies to achieve long-term goals and at the same time it can be easily adjusted to address short-term requirements.
arXiv Detail & Related papers (2021-09-28T03:35:09Z)
A Minimalist Approach to Offline Reinforcement Learning [10.904148149681932]
offline reinforcement learning defines the task of learning from a fixed batch of data. In this paper we aim to make a deep RL algorithm work while making minimal changes. We find that we can match the performance of state-of-the-art offline RL algorithms by simply adding a behavior cloning term to the policy update of an online RL algorithm.
arXiv Detail & Related papers (2021-06-12T20:38:59Z)
Evolving Reinforcement Learning Algorithms [186.62294652057062]
We propose a method for meta-learning reinforcement learning algorithms. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. We highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games.
arXiv Detail & Related papers (2021-01-08T18:55:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.