Related papers: Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

URL: http://arxiv.org/abs/2503.24290v1
Date: Mon, 31 Mar 2025 16:36:05 GMT
Title: Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Authors: Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum,
Abstract summary: We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training.<n>In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.
Score: 47.108822717757945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.

Related papers

SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM [18.275547804539016]
Two-Staged history-Resampling Policy optimization surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. We introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples.
arXiv Detail & Related papers (2025-04-19T13:06:03Z)
Understanding R1-Zero-Like Training: A Critical Perspective [38.515771096651356]
We critically examine R1-Zero-like training by analyzing its two core components: base models and RL.<n>We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance.<n>We present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model.
arXiv Detail & Related papers (2025-03-26T17:59:14Z)
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [70.77691645678804]
We present the first successful replication of emergent characteristics for multimodal reasoning on only a non-SFT 2B model.<n>Our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately 30% and exceeding both SFT setting by 2%.<n>In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models.
arXiv Detail & Related papers (2025-03-07T04:21:47Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM.<n> START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging.<n>It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence [0.0]
We address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging reasoning capabilities.<n>Our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline.<n>We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B)
arXiv Detail & Related papers (2025-02-18T16:44:55Z)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [147.16121855209246]
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.<n>DeepSeek-R1-Zero is trained via large-scale reinforcement learning.<n>DeepSeek-R1 incorporates multi-stage training and cold-start data before RL.
arXiv Detail & Related papers (2025-01-22T15:19:35Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
Model Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to expedite alignment training with human preferences. We demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one. We show that ExPO notably improves existing open-source LLMs on the leading AlpacaEval 2.0 and MT-Bench benchmarks.
arXiv Detail & Related papers (2024-04-25T17:39:50Z)
DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training [33.11416096294998]
Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems. No prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. We develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch.
arXiv Detail & Related papers (2023-10-03T13:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.