DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- URL: http://arxiv.org/abs/2503.14476v1
- Date: Tue, 18 Mar 2025 17:49:06 GMT
- Title: DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- Authors: Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, Mingxuan Wang,
- Abstract summary: We open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model.<n>In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset.
- Score: 63.24798333145823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
Related papers
- Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [0.0]
Our study investigates the potential of reinforcement learning to improve reasoning in small language models (LLMs)
Training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours resulted in rapid reasoning gains.
These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches.
arXiv Detail & Related papers (2025-03-20T15:13:23Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [46.5893728376551]
This paper introduces SWE-RL, the first approach to scale RL-based large language models (LLMs) for real-world software engineering.<n>Llama3-SWE-RL-70B achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues.<n>Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills.
arXiv Detail & Related papers (2025-02-25T18:45:04Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.
Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.
We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - Control LLM: Controlled Evolution for Intelligence Retention in LLM [4.67235851066221]
We propose textbfControl LLM, a novel approach that leverages parallel pre-trained and expanded transformer blocks.<n>Experiments demonstrate the effectiveness of Control LLM in both Continuous Pre-training (CPT) and Continuous Supervised Fine-Tuning (CSFT)<n>It surpasses existing methods and achieves SOTA among open-source models tuned from the same base model, using substantially less data and compute.
arXiv Detail & Related papers (2025-01-19T08:06:06Z) - KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction [59.039355258637315]
We propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation.
KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes.
KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning.
arXiv Detail & Related papers (2024-03-12T14:56:34Z) - Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences.
Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning [50.9692060692705]
This paper introduces $textbfLanguage Models for $textbfMo$tion Control ($textbfLaMo$), a general framework based on Decision Transformers for offline RL.<n>Our framework highlights four crucial components:.<n>Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method,.<n>In particular, our method demonstrates superior performance in scenarios with limited data samples.
arXiv Detail & Related papers (2023-10-31T16:24:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.