Generalization of RLVR Using Causal Reasoning as a Testbed
- URL: http://arxiv.org/abs/2512.20760v1
- Date: Tue, 23 Dec 2025 20:45:31 GMT
- Title: Generalization of RLVR Using Causal Reasoning as a Testbed
- Authors: Brian Lu, Hongyu Zhao, Shuo Sun, Hao Peng, Rui Ding, Hongyuan Mei,
- Abstract summary: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models.<n>This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal models.
- Score: 20.97376329817835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain poorly understood. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query -- associational, interventional, or counterfactual -- and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct datasets of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in training. We find that RLVR yields stronger within-level and across-level generalization than SFT, but only for specific combinations of model size and training query level. Further analysis shows that RLVR's effectiveness depends on the model's initial reasoning competence. With sufficient initial competence, RLVR improves an LLM's marginalization strategy and reduces errors in intermediate probability calculations, producing substantial accuracy gains, particularly on more complex queries. These findings show that RLVR can improve specific causal reasoning subskills, with its benefits emerging only when the model has sufficient initial competence.
Related papers
- New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR [44.98294610511283]
We propose a probabilistic framework where capability is defined by instance-level solvability.<n>We train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks.<n>Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.
arXiv Detail & Related papers (2026-02-09T05:23:13Z) - Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning [82.91265691530351]
A$2$D is an Adaptive Ability Decomposing method for enhancing the effectiveness ofReinforcement Learning with verifiable rewards.<n>We first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions.<n>Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance.
arXiv Detail & Related papers (2026-01-31T14:48:23Z) - Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration [33.02780998281276]
Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models.<n>This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR.
arXiv Detail & Related papers (2025-11-02T04:16:47Z) - Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning [3.437656066916039]
Reinforcement with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities.<n>We investigate RLVR on two problems with fully verifiable solutions.<n>We find that RLVR improves evaluation metrics but often by reinforcing superficial Learning metrics rather than acquiring new reasoning strategies.
arXiv Detail & Related papers (2025-10-30T23:16:02Z) - Making Mathematical Reasoning Adaptive [61.45161826629692]
We propose the AdaR framework to enable adaptive reasoning in large language models (LLMs)<n>AdaR synthesizes logically equivalent queries by varying variable values, and trains models with RLVR on these data to penalize spurious logic.<n> Experimental results demonstrate that AdaR improves robustness and generalization, achieving substantial improvement in mathematical reasoning.
arXiv Detail & Related papers (2025-10-06T09:30:05Z) - RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [111.1749164063616]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - RLPR: Extrapolating RLVR to General Domains without Verifiers [103.14103272635893]
We propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains.<n>We find that addressing the high variance of this noisy probability reward is crucial to make it work.<n>RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models.
arXiv Detail & Related papers (2025-06-23T02:56:36Z) - Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection [35.268183415853976]
We provide an explanation of the RL training process through empirical analysis and rigorous theoretical modeling.<n>We develop a theoretical framework to understand the training dynamics of RL with two typical rewards: reward (RLVR) and model's internal feedback (RLIF)
arXiv Detail & Related papers (2025-06-05T07:17:04Z) - Table-R1: Inference-Time Scaling for Table Reasoning [56.812846737424245]
We develop and evaluate two post-training strategies to enable inference-time scaling.<n>For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1.<n>For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model.
arXiv Detail & Related papers (2025-05-29T16:28:50Z) - Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks [0.7988085110283119]
Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs only marginally outperform random baselines.<n>We provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph.<n> Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods.
arXiv Detail & Related papers (2025-05-23T15:37:40Z) - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [66.61292196146016]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.