Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
- URL: http://arxiv.org/abs/2507.17512v1
- Date: Wed, 23 Jul 2025 13:51:04 GMT
- Title: Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
- Authors: Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu,
- Abstract summary: We present a systematic investigation of multi-domain reasoning within the RLVR framework.<n>We focus on three primary domains: mathematical reasoning, code generation, and logical puzzle solving.<n>Our results offer significant insights into the dynamics governing domain interactions.
- Score: 32.665418383317224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.
Related papers
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - Reasoning Language Models for Root Cause Analysis in 5G Wireless Networks [10.074110713679739]
Root Cause Analysis (RCA) in mobile networks remains a challenging task due to the need for interpretability, domain expertise, and causal reasoning.<n>We propose a lightweight framework that leverages Large Language Models (LLMs) for RCA.
arXiv Detail & Related papers (2025-07-29T16:21:42Z) - Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities [62.05713042908654]
This paper provides a review of advances in Large Language Models (LLMs) alignment through the lens of inverse reinforcement learning (IRL)<n>We highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift.
arXiv Detail & Related papers (2025-07-17T14:22:24Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems [93.8285345915925]
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making.<n>With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems.<n>We categorize existing methods along two dimensions: (1) Regimes, which define the stage at which reasoning is achieved; and (2) Architectures, which determine the components involved in the reasoning process.
arXiv Detail & Related papers (2025-04-12T01:27:49Z) - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - Large Language Model Enhanced Knowledge Representation Learning: A Survey [15.602891714371342]
Knowledge Representation Learning (KRL) is crucial for enabling applications of symbolic knowledge from Knowledge Graphs to downstream tasks.<n>This work provides a broad overview of downstream tasks while simultaneously identifying emerging research directions in these evolving domains.
arXiv Detail & Related papers (2024-07-01T03:37:35Z) - Learning by Doing: An Online Causal Reinforcement Learning Framework with Causal-Aware Policy [38.86867078596718]
We consider explicitly modeling the generation process of states with the graphical causal model.<n>We formulate the causal structure updating into the RL interaction process with active intervention learning of the environment.
arXiv Detail & Related papers (2024-02-07T14:09:34Z) - Structure in Deep Reinforcement Learning: A Survey and Open Problems [22.77618616444693]
Reinforcement Learning (RL), bolstered by the expressive capabilities of Deep Neural Networks (DNNs) for function approximation, has demonstrated considerable success in numerous applications.
However, its practicality in addressing various real-world scenarios, characterized by diverse and unpredictable dynamics, remains limited.
This limitation stems from poor data efficiency, limited generalization capabilities, a lack of safety guarantees, and the absence of interpretability.
arXiv Detail & Related papers (2023-06-28T08:48:40Z) - Generalizing Goal-Conditioned Reinforcement Learning with Variational
Causal Reasoning [24.09547181095033]
Causal Graph is a structure built upon the relation between objects and events.
We propose a framework with theoretical performance guarantees that alternates between two steps.
Our performance improvement is attributed to the virtuous cycle of causal discovery, transition modeling, and policy training.
arXiv Detail & Related papers (2022-07-19T05:31:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.