Related papers: FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models

Related papers

The Art of Efficient Reasoning: Data, Reward, and Optimization [20.542546956993363]
Large Language Models (LLMs) benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead.<n> efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL)<n>We conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies.
arXiv Detail & Related papers (2026-02-24T14:28:16Z)
Boosting Accuracy and Efficiency of Budget Forcing in LLMs via Reinforcement Learning for Mathematical Reasoning [1.4348015996689416]
We propose a framework integrating reinforcement learning (RL) to improve token efficiency and boost the performance of a 1.5B model for mathematical reasoning.<n>Our main findings showed an overall higher accuracy while significantly reducing its token usage by over 40% compared to the SFT model.
arXiv Detail & Related papers (2025-10-24T12:39:15Z)
The Art of Scaling Reinforcement Learning Compute for LLMs [52.71086085139566]
Reinforcement learning (RL) has become central to training large language models.<n>Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute.<n>We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours.
arXiv Detail & Related papers (2025-10-15T17:43:03Z)
CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z)
Learning to Reason as Action Abstractions with Scalable Mid-Training RL [55.24192942739207]
An effective mid-training phase should identify a compact set of useful actions and enable fast selection.<n>We propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm.
arXiv Detail & Related papers (2025-09-30T05:34:20Z)
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning [42.80470927369973]
We study how model scale, data volume, and computational budget interact to shape performance.<n>We find that larger models trained for fewer steps consistently outperform smaller models trained for more steps.<n>In data-constrained regimes, repeated reuse of high-quality data proves highly effective.
arXiv Detail & Related papers (2025-09-29T17:10:35Z)
Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z)
Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z)
Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z)
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
AmorLIP: Efficient Language-Image Pretraining via Amortization [52.533088120633785]
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks.<n>We propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks.
arXiv Detail & Related papers (2025-05-25T05:30:37Z)
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z)
Efficient RL Training for Reasoning Models via Length-Aware Optimization [108.88337262486819]
We propose three critical reward designs integrated directly into the reinforcement learning process of large reasoning models.<n>Our method significantly decreases response length while maintaining or even improving performance.
arXiv Detail & Related papers (2025-05-18T07:46:43Z)
LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning [22.242445543184264]
We propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop.<n>Experiments show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10x.
arXiv Detail & Related papers (2025-05-12T10:57:51Z)
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models [54.44375226381814]
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling. We introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach achieves state-of-the-art performance across a diverse set of long-context benchmarks.
arXiv Detail & Related papers (2025-04-08T16:58:58Z)
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model [47.108822717757945]
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model.<n>We demonstrate that PPO with GAE and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length.
arXiv Detail & Related papers (2025-03-31T16:36:05Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Kimi k1.5: Scaling Reinforcement Learning with LLMs [84.95584393629998]
We report on the training practice of Kimi k1.5, our latest multi-modal language model trained with reinforcement learning.<n>Long context scaling and improved policy optimization methods are key ingredients of our approach.<n>Our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities.
arXiv Detail & Related papers (2025-01-22T02:48:14Z)
Adjoint sharding for very long context training of state space models [7.723642550918118]
Adjoint sharding is a technique that comprises sharding gradient calculation during training to reduce memory requirements by orders of magnitude.<n>We show the proposed adjoint sharding algorithm reduces memory usage by up to 3X with a 1.27B parameter large language model on 1M context length training.<n>This allows to increase the maximum context length during training or fine-tuning of a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.
arXiv Detail & Related papers (2025-01-01T01:10:59Z)
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [83.53178716807776]
This study explores the scaling properties of Reinforcement Learning from Human Feedback in Large Language Models. We analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance.
arXiv Detail & Related papers (2024-12-08T17:19:48Z)
How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. ProLong-8B, which is from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z)
Model-Based Transfer Learning for Contextual Reinforcement Learning [5.5597941107270215]
We introduce Model-Based Transfer Learning to solve contextual RL problems. We show theoretically that the method exhibits sublinear regret in the number of training tasks. We experimentally validate our methods using urban traffic and standard continuous control benchmarks.
arXiv Detail & Related papers (2024-08-08T14:46:01Z)
E^2-LLM: Efficient and Extreme Length Extension of Large Language Models [74.1254067728251]
We propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced cost. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
arXiv Detail & Related papers (2024-01-13T02:11:20Z)
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z)
Dataset Pruning: Reducing Training Data by Examining Generalization Influence [30.30255670341501]
Do all training data contribute to model's performance? How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
arXiv Detail & Related papers (2022-05-19T05:36:35Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Pretraining & Reinforcement Learning: Sharpening the Axe Before Cutting the Tree [2.0142516017086165]
Pretraining is a common technique in deep learning for increasing performance and reducing training time. We evaluate the effectiveness of pretraining for RL tasks, with and without distracting backgrounds, using both large, publicly available datasets and case-by-case generated datasets. Results suggest filters learned during training on less relevant datasets render pretraining ineffective, while filters learned during training on the in-distribution datasets reliably reduce RL training time and improve performance after 80k RL training steps.
arXiv Detail & Related papers (2021-10-06T04:25:14Z)
Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models. Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z)
Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper. Our method makes use of information from both intra- and inter-images. It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z)
Learning to Prune Deep Neural Networks via Reinforcement Learning [64.85939668308966]
PuRL is a deep reinforcement learning based algorithm for pruning neural networks. It achieves sparsity and accuracy comparable to current state-of-the-art methods.
arXiv Detail & Related papers (2020-07-09T13:06:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.