PretrainZero: Reinforcement Active Pretraining
- URL: http://arxiv.org/abs/2512.03442v1
- Date: Wed, 03 Dec 2025 04:51:32 GMT
- Title: PretrainZero: Reinforcement Active Pretraining
- Authors: Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, Debing Zhang,
- Abstract summary: We propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus.<n>PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus.<n>In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
- Score: 43.0311336005895
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
Related papers
- ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution [49.496216822640974]
We analyze training dynamics and identify the mid-training phase as a critical turning point for model capabilities.<n>We introduce ReMiT (Reinforcement Learning-Guided Mid-Training), which prioritizes tokens during the mid-training phase, prioritizing those pivotal for reasoning.
arXiv Detail & Related papers (2026-02-03T04:04:41Z) - On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models [73.10315509190623]
Recent reinforcement learning techniques have yielded impressive reasoning improvements in language models.<n>It remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training.<n>We develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training.
arXiv Detail & Related papers (2025-12-08T18:12:10Z) - Zero Reinforcement Learning Towards General Domains [27.62364890827269]
We propose a novel zero-RL paradigm designed to improve a model's reasoning ability across both verifiable and non-verifiable domains.<n>By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains.<n> Experimental results on Qwen3-8B-Base and Qwen3-14B-Base demonstrate that our approach achieves superior reasoning performance.
arXiv Detail & Related papers (2025-10-29T13:52:44Z) - From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining [2.569647910019739]
We study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner.<n>Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.
arXiv Detail & Related papers (2025-10-08T00:59:33Z) - RLP: Reinforcement as a Pretraining Objective [103.45068938532923]
We present an information-driven reinforcement pretraining objective that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining.<n>This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining.<n> Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning.
arXiv Detail & Related papers (2025-09-26T17:53:54Z) - Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z) - Understanding R1-Zero-Like Training: A Critical Perspective [73.25430192337235]
We critically examine R1-Zero-like training by analyzing its two core components: base models and RL.<n>We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance.<n>We present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model.
arXiv Detail & Related papers (2025-03-26T17:59:14Z) - Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning [134.15174177472807]
We introduce adversarial training into self-supervision, to provide general-purpose robust pre-trained models for the first time.
We conduct extensive experiments to demonstrate that the proposed framework achieves large performance margins.
arXiv Detail & Related papers (2020-03-28T18:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.