Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
- URL: http://arxiv.org/abs/2509.07980v2
- Date: Fri, 12 Sep 2025 17:15:56 GMT
- Title: Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
- Authors: Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu,
- Abstract summary: Parallel thinking is a novel approach for enhancing the reasoning capabilities of large language models.<n>We propose textbfParallel-R1, the first reinforcement learning framework that enables parallel thinking behaviors.<n>Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking.
- Score: 65.68667585027232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.
Related papers
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models [73.10315509190623]
Recent reinforcement learning techniques have yielded impressive reasoning improvements in language models.<n>It remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training.<n>We develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training.
arXiv Detail & Related papers (2025-12-08T18:12:10Z) - Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning [68.9332598692234]
We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities.<n>NPR transforms the model from sequential emulation to native parallel cognition through three key innovations.
arXiv Detail & Related papers (2025-12-08T11:39:43Z) - A Survey on Parallel Reasoning [58.66122129692264]
We first present a formal definition of parallel reasoning and clarify its distinction from related concepts like Chain-of-Thought.<n>We then organize and discuss advanced techniques based on a novel taxonomy, including non-interactive reasoning, interactive reasoning, and efficiency-focused decoding strategies.<n>We highlight the core challenges of parallel reasoning and suggest potential directions for future research.
arXiv Detail & Related papers (2025-10-14T05:42:19Z) - Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers [16.135928990655422]
This paper introduces textttBFS-Prover-V2, a system designed to address the dual scaling problem.<n>The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time.<n>The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time.
arXiv Detail & Related papers (2025-09-08T09:54:18Z) - ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute [32.915370020808105]
ParaThinker is an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel.<n>It effectively sidesteps the Tunnel Vision issue and unlocks the model's latent reasoning potential.<n>On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs.
arXiv Detail & Related papers (2025-08-30T03:09:07Z) - History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL [14.506189610798929]
Reinforcement learning (RL) has emerged as a pivotal methodology for enhancing the reasoning capabilities of large language models (LLMs)<n>We introduce RhymeRL, an LLM RL system designed to accelerate RL training with two key innovations.<n>First, to enhance rollout generation, we present HistoSpec, a speculative decoding inference engine.<n>Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier scheduling strategy.
arXiv Detail & Related papers (2025-08-26T01:42:46Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning [28.92744927199283]
ReVisual-R1 achieves a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.
arXiv Detail & Related papers (2025-06-04T17:51:08Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning [31.21491548356213]
Backtracking naturally scales test-time compute by enabling sequential, linearized exploration via long chain-of-thought (CoT) generation.<n>Despite the growing adoption of sequential search, its advantages over parallel sampling remain poorly understood.<n>We show that models with backtracking capabilities benefit significantly from RL fine-tuning, while models without backtracking see limited, mixed gains.
arXiv Detail & Related papers (2025-04-09T17:12:49Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.