Better LLM Reasoning via Dual-Play
- URL: http://arxiv.org/abs/2511.11881v2
- Date: Wed, 19 Nov 2025 01:20:49 GMT
- Title: Better LLM Reasoning via Dual-Play
- Authors: Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie,
- Abstract summary: We introduce PasoDoble, a novel dual-play framework for large language models.<n>PasoDoble adversarially trains two models from the same base model.<n> Experimental results show that PasoDoble can improve the reasoning performance of LLMs.
- Score: 13.152283780379278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions' quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver's limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs. Our project page is available at https://hcy123902.github.io/PasoDoble.
Related papers
- Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning [41.523848964102]
Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL)<n>RL provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience.<n>Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties.<n>We propose DoGe, a dual-decoupling framework that guides models to first learn from context rather than problem solving.
arXiv Detail & Related papers (2025-12-07T13:17:31Z) - Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs [72.08224879435762]
textttLearn-to-Ask is a simulator-free framework for learning and deploying proactive dialogue agents.<n>Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service.
arXiv Detail & Related papers (2025-10-29T12:08:07Z) - Multi-Agent Evolve: LLM Self-Improve through Co-evolution [53.00458074754831]
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs)<n>Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data.<n>We propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A.
arXiv Detail & Related papers (2025-10-27T17:58:02Z) - EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning [145.32076310071434]
We propose EvolveNav, a novel embodied reasoning paradigm that realizes adaptable and generalizable navigational reasoning.<n>EvolveNav involves a two-stage training process: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with curated formalized CoT labels to first activate the model's navigational reasoning capabilities, and simultaneously increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity.
arXiv Detail & Related papers (2025-06-02T11:28:32Z) - First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training [37.80193099472551]
We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs.<n>Our experiments demonstrate that such training method effectively improves the reasoning ability of Qwen2.5-VL-7B.<n>We extend our framework to a data self-generation setting, designing two strategies that prompt the MLLM to synthesize new training samples.
arXiv Detail & Related papers (2025-05-28T15:11:16Z) - From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [82.50157695987558]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z) - Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models [19.015202590038996]
We design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack unlearned models.
We propose Latent Adrial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process.
We demonstrate that LAU improves unlearning effectiveness by over $53.5%$, cause only less than a $11.6%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.
arXiv Detail & Related papers (2024-08-20T09:36:04Z) - MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models.
MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z) - Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena [126.70522244144088]
We introduce Arena Learning, an innovative offline strategy designed to simulate arena battles using AI-driven annotations.
Arena Learning ensures precise evaluations and maintains consistency between offline simulations and online competitions.
We apply Arena Learning to train our target model, WizardLM-$beta$, and demonstrate significant performance enhancements.
arXiv Detail & Related papers (2024-07-15T11:26:07Z) - Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions [38.48223545539604]
We develop a novel two-step learning method called PRC: preference-based reinforcement learning with constrained actions.
We empirically verify that our method has high learning efficiency on various datasets in robotic control environments.
arXiv Detail & Related papers (2023-12-30T21:37:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.