Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning
- URL: http://arxiv.org/abs/2602.06475v1
- Date: Fri, 06 Feb 2026 08:03:11 GMT
- Title: Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning
- Authors: Jingyao Wang, Peizheng Guo, Wenwen Qiang, Jiahuan Zhou, Huijie Guo, Changwen Zheng, Hui Xiong,
- Abstract summary: Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities.<n>We propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns.<n>We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust.
- Score: 50.352417879912515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities. However, existing reward mechanisms remain tightly coupled to final correctness and pay little attention to the underlying reasoning process: trajectories with sound reasoning but wrong answers receive low credit, while lucky guesses with flawed logic may be highly rewarded, affecting reasoning generalization. From a causal perspective, we interpret multi-candidate reasoning for a fixed question as a family of counterfactual experiments with theoretical supports. Building on this, we propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns. It proposes an episodic causal counterfactual reward that jointly captures (i) robustness, encouraging the answer distribution induced by a reasoning step to remain stable under counterfactual perturbations; and (ii) effectiveness, enforcing sufficient variability so that the learned reasoning strategy can transfer across questions. We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust. Extensive experiments on diverse benchmarks demonstrate its advantages.
Related papers
- CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching [50.65932158912512]
We propose a new causal reasoning benchmark, CausalFlip, to encourage the development of new large language models.<n>CaulFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations.<n>We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought supervision, and a proposed internalized causal reasoning approach.
arXiv Detail & Related papers (2026-02-23T18:06:15Z) - Reinforced Efficient Reasoning via Semantically Diverse Exploration [73.41112984160992]
Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs)<n>We propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs.<n>Our method incorporates a semantic-entropy-based branching strategy and an $varepsilon$-exploration mechanism.
arXiv Detail & Related papers (2026-01-08T15:56:44Z) - Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning [19.473649388687484]
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors.<n>We introduce Geneversarative Adrial Reasoner, an on-policy joint training framework designed to enhance reasoning.<n>A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with structured justifications.
arXiv Detail & Related papers (2025-12-18T18:59:54Z) - RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models [21.671577399379885]
We introduce RAVR, an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning.<n> RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.
arXiv Detail & Related papers (2025-10-29T06:18:37Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning [52.83539473110143]
We introduce a novel structure-oriented analysis method to help Large Language Models (LLMs) better understand a question.
To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA)
Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods.
arXiv Detail & Related papers (2024-10-18T05:30:33Z) - CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Causal Significance and Consistency [11.144164626192904]
Chain-based methods like chain of thought (CoT) play a rising role in solving reasoning tasks for large language models (LLMs)<n>This paper proposes a non-chain-based reasoning framework for simultaneous consideration of causal significance and consistency.
arXiv Detail & Related papers (2024-09-20T08:28:23Z) - Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models [84.15513004135576]
Current research enhances the reasoning performance of Large Language Models (LLMs) by sampling multiple reasoning chains and ensembling based on the answer frequency.
This approach fails in scenarios where the correct answers are in the minority.
We introduce a hierarchical reasoning aggregation framework AoR, which selects answers based on the evaluation of reasoning chains.
arXiv Detail & Related papers (2024-05-21T17:12:19Z) - Self-Contradictory Reasoning Evaluation and Detection [31.452161594896978]
We investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers.
We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense.
We find that GPT-4 can detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans.
arXiv Detail & Related papers (2023-11-16T06:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.