RLP: Reinforcement as a Pretraining Objective
- URL: http://arxiv.org/abs/2510.01265v1
- Date: Fri, 26 Sep 2025 17:53:54 GMT
- Title: RLP: Reinforcement as a Pretraining Objective
- Authors: Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi,
- Abstract summary: We present an information-driven reinforcement pretraining objective that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining.<n>This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining.<n> Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning.
- Score: 103.45068938532923
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.
Related papers
- ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution [49.496216822640974]
We analyze training dynamics and identify the mid-training phase as a critical turning point for model capabilities.<n>We introduce ReMiT (Reinforcement Learning-Guided Mid-Training), which prioritizes tokens during the mid-training phase, prioritizing those pivotal for reasoning.
arXiv Detail & Related papers (2026-02-03T04:04:41Z) - PretrainZero: Reinforcement Active Pretraining [43.0311336005895]
We propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus.<n>PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus.<n>In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
arXiv Detail & Related papers (2025-12-03T04:51:32Z) - Midtraining Bridges Pretraining and Posttraining Distributions [73.84346031272473]
"Midtraining" is a phase in which higher quality, often instruction-formatted data is mixed in at the end of pretraining.<n>We conduct the first systematic investigation of midtraining through experiments with language models pretrained from scratch.<n>We find that when compared after supervised fine-tuning, the effectiveness of midtraining is highest in the math and code domains.
arXiv Detail & Related papers (2025-10-16T16:39:52Z) - Reinforcement Mid-Training [16.826401071555704]
We propose a framework for efficient, adaptive, and unified reinforcement mid-training.<n>We show that RMT achieves up to +64.91% performance improvement with only 21% of the reasoning length in language modeling.<n>We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.
arXiv Detail & Related papers (2025-09-29T07:21:24Z) - Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z) - BOW: Reinforcement Learning for Bottlenecked Next Word Prediction [9.219154888448736]
We present BOttle next-Word prediction (BOW), a RL formulation of next-word prediction (NWP)<n>BOW is a viable alternative to vanilla NWP, inducing explicit next-word reasoning and strengthening general reasoning ability.
arXiv Detail & Related papers (2025-06-16T13:58:54Z) - Reinforcement Pre-Training [78.5355979575498]
We introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL)<n>RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers.<n>The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
arXiv Detail & Related papers (2025-06-09T17:59:53Z) - The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.