SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
- URL: http://arxiv.org/abs/2503.18892v1
- Date: Mon, 24 Mar 2025 17:06:10 GMT
- Title: SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
- Authors: Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, Junxian He,
- Abstract summary: Long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning framework with rule-based rewards.<n>We investigate zero RL training across 10 diverse base models including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B.
- Score: 46.25416990387885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.
Related papers
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [67.30809748319486]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs.
We re-examine this assumption by measuring the pass@textitk metric with large values of textitk to explore the reasoning capability boundary of the models.
We find that the RL does emphnot, in fact, elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.
We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Understanding R1-Zero-Like Training: A Critical Perspective [38.515771096651356]
We critically examine R1-Zero-like training by analyzing its two core components: base models and RL.
We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance.
We present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model.
arXiv Detail & Related papers (2025-03-26T17:59:14Z) - R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [70.77691645678804]
We present the first successful replication of emergent characteristics for multimodal reasoning on only a non-SFT 2B model.<n>Our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately 30% and exceeding both SFT setting by 2%.<n>In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models.
arXiv Detail & Related papers (2025-03-07T04:21:47Z) - An Empirical Study on Eliciting and Improving R1-like Reasoning Models [90.52239241349504]
scaling RL training has become a central technique for implementing such reasoning models.<n>We demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models.<n>We also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models.
arXiv Detail & Related papers (2025-03-06T15:34:27Z) - Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z) - Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning [6.345851712811528]
We introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ)<n>LEQ provides a low-bias model-based value estimation via lower expectile regression of $lambda$-returns.<n>Our studies demonstrate that lower expectile regression, $lambda$-returns, and critic training on offline data are all crucial for LEQ.
arXiv Detail & Related papers (2024-06-30T13:44:59Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.