Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
- URL: http://arxiv.org/abs/2510.06499v1
- Date: Tue, 07 Oct 2025 22:30:59 GMT
- Title: Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
- Authors: Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao,
- Abstract summary: We introduce the Webscale-RL pipeline, a scalable data engine for reinforcement learning.<n>We construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains.<n>Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
- Score: 96.35283762778137
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
Related papers
- Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models [56.12341509545198]
Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL)<n>Best practices emphasize large, diverse pretraining data, whereas post-training operates differently.<n>We theoretically analyze transformers trained on an in-context weight prediction task for linear regression.
arXiv Detail & Related papers (2026-03-01T21:58:09Z) - Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories [33.872433985210876]
Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories.<n>We propose Discover, Lea rn and Reinforce, which generates multiple distinct, high-success behavioral patterns for VLA pretraining.<n>When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets.
arXiv Detail & Related papers (2025-11-24T07:54:49Z) - Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z) - Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies [66.83950068218033]
Scaling Laws demonstrate that scaling model parameters and training data enhances learning performance.<n>Despite its potential to improve performance, the integration of scaling laws into deep reinforcement learning has not been fully realized.<n>This review addresses this gap by systematically analyzing scaling strategies in three dimensions: data, network, and training budget.
arXiv Detail & Related papers (2025-08-05T08:03:12Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - LIMR: Less is More for RL Scaling [25.477841726836836]
We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples.<n>Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset.<n>For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models.
arXiv Detail & Related papers (2025-02-17T15:13:29Z) - Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models.
To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters.
We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.