R-Zero: Self-Evolving Reasoning LLM from Zero Data
- URL: http://arxiv.org/abs/2508.05004v1
- Date: Thu, 07 Aug 2025 03:38:16 GMT
- Title: R-Zero: Self-Evolving Reasoning LLM from Zero Data
- Authors: Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, Dong Yu,
- Abstract summary: Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences.<n>Existing methods for training such models still rely heavily on vast human-curated tasks and labels.<n>We introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.
- Score: 56.74402018426378
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.
Related papers
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback [3.73824942136665]
Large Language Models (LLMs) often produce plausible but poorly-calibrated answers.<n>We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward.
arXiv Detail & Related papers (2025-07-29T15:46:26Z) - L0: Reinforcement Learning to Become General Agents [9.210885263681119]
We introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents.<n> NB-Agent, the agent scaffold within L0, operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL)<n>Experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR)
arXiv Detail & Related papers (2025-06-30T09:44:32Z) - Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z) - Absolute Zero: Reinforced Self-play Reasoning with Zero Data [61.46462130246158]
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models.<n>We introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability.<n>AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models.
arXiv Detail & Related papers (2025-05-06T09:08:00Z) - Generalising from Self-Produced Data: Model Training Beyond Human Constraints [0.0]
This paper introduces a novel framework in which AI models autonomously generate and validate new knowledge.<n>Central to this approach is an unbounded, ungamable numeric reward that guides learning without requiring human benchmarks.
arXiv Detail & Related papers (2025-04-07T03:48:02Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - SELF: Self-Evolution with Language Feedback [68.6673019284853]
'SELF' (Self-Evolution with Language Feedback) is a novel approach to advance large language models.
It enables LLMs to self-improve through self-reflection, akin to human learning processes.
Our experiments in mathematics and general tasks demonstrate that SELF can enhance the capabilities of LLMs without human intervention.
arXiv Detail & Related papers (2023-10-01T00:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.