CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning
- URL: http://arxiv.org/abs/2602.02979v1
- Date: Tue, 03 Feb 2026 01:38:53 GMT
- Title: CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning
- Authors: Ran Li, Zeyuan Liu, Yinghao chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Zhiyuan Liu, Maosong Sun,
- Abstract summary: We introduce CPMbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models.<n>Unlike traditional adversarial self-play, CPMbius treats the Coach and Player as independent but cooperative roles.<n>CPMbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches.
- Score: 55.425576693143285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy.
Related papers
- When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents [2.689316553293938]
Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks.<n>We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools and the final answer generation for conversational agents.
arXiv Detail & Related papers (2025-12-12T04:44:40Z) - More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration [103.1589018460702]
"guidance-on-demand" approach expands exploration while preserving the value of self-discovery.<n>Experiments show AMPO substantially outperforms a strong baseline.<n>Using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher.
arXiv Detail & Related papers (2025-10-02T17:14:00Z) - SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [27.20778530252474]
SPIRAL is a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves.<n>Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly.<n>Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis.
arXiv Detail & Related papers (2025-06-30T17:58:13Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - GAT: Guided Adversarial Training with Pareto-optimal Auxiliary Tasks [73.88590165742721]
We propose a novel adversarial training technique that exploits auxiliary tasks under a limited set of training data.
Our approach extends single-task models into multi-task models during the min-max optimization of adversarial training.
We demonstrate that guided multi-task learning is an actionable and promising avenue to push further the boundaries of model robustness.
arXiv Detail & Related papers (2023-02-06T16:23:24Z) - Probing Transfer in Deep Reinforcement Learning without Task Engineering [26.637254541454773]
We evaluate the use of original game curricula supported by the Atari 2600 console as a heterogeneous transfer benchmark for deep reinforcement learning agents.
Game designers created curricula using combinations of several discrete modifications to the basic versions of games such as Space Invaders, Breakout and Freeway.
We show that zero-shot transfer from the basic games to their variations is possible, but the variance in performance is also largely explained by interactions between factors.
arXiv Detail & Related papers (2022-10-22T13:40:12Z) - Multi-trainer Interactive Reinforcement Learning System [7.3072544716528345]
We propose a more effective interactive reinforcement learning system by introducing multiple trainers.
In particular, our trainer feedback aggregation experiments show that our aggregation method has the best accuracy.
Finally, we conduct a grid-world experiment to show that the policy trained by the MTIRL with the review model is closer to the optimal policy than that without a review model.
arXiv Detail & Related papers (2022-10-14T18:32:59Z) - Predicting Game Engagement and Difficulty Using AI Players [3.0501851690100277]
This paper presents a novel approach to automated playtesting for the prediction of human player behavior and experience.
It has previously been demonstrated that Deep Reinforcement Learning game-playing agents can predict both game difficulty and player engagement.
We improve this approach by enhancing DRL with Monte Carlo Tree Search (MCTS)
arXiv Detail & Related papers (2021-07-26T09:31:57Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness.
We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z) - Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning [134.15174177472807]
We introduce adversarial training into self-supervision, to provide general-purpose robust pre-trained models for the first time.
We conduct extensive experiments to demonstrate that the proposed framework achieves large performance margins.
arXiv Detail & Related papers (2020-03-28T18:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.