Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency
- URL: http://arxiv.org/abs/2602.16787v1
- Date: Wed, 18 Feb 2026 19:00:07 GMT
- Title: Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency
- Authors: Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma, Javier Gonzalez, Niranjani Prasad,
- Abstract summary: We introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of large language models to reason causally.<n>We evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions.
- Score: 11.717694690378686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs' causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model's ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.
Related papers
- CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching [50.65932158912512]
We propose a new causal reasoning benchmark, CausalFlip, to encourage the development of new large language models.<n>CaulFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations.<n>We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought supervision, and a proposed internalized causal reasoning approach.
arXiv Detail & Related papers (2026-02-23T18:06:15Z) - Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning [50.352417879912515]
Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities.<n>We propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns.<n>We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust.
arXiv Detail & Related papers (2026-02-06T08:03:11Z) - A State-Transition Framework for Efficient LLM Reasoning [58.18141262230392]
Long Chain-of-Thought (CoT) reasoning significantly improves Large Language Models (LLMs) performance on complex reasoning tasks.<n>Existing studies usually enhance the reasoning efficiency of LLMs by compressing CoT sequences.<n>We propose an efficient reasoning framework that models the reasoning process of LLMs as a state-transition process.
arXiv Detail & Related papers (2026-02-01T12:40:40Z) - Code Execution as Grounded Supervision for LLM Reasoning [35.258360651475286]
Training large language models with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities.<n>We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution.<n>Our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning.
arXiv Detail & Related papers (2025-06-12T04:36:57Z) - Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference [16.706959860667133]
It remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference.<n>The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.
arXiv Detail & Related papers (2025-05-19T23:06:00Z) - Efficient Inference for Large Reasoning Models: A Survey [74.17203483365171]
Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason.<n>However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time.<n>This survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality.
arXiv Detail & Related papers (2025-03-29T13:27:46Z) - Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [49.61246073215651]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.<n>However, they also introduce significant computational overhead due to verbose and redundant outputs.
arXiv Detail & Related papers (2025-03-20T17:59:38Z) - CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models [5.409370027524351]
We evaluate the performance of large language models (LLMs) in counterfactual reasoning.<n>We introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions.
arXiv Detail & Related papers (2025-02-16T06:19:37Z) - COLD: Causal reasOning in cLosed Daily activities [7.782872276680731]
We propose the COLD (Causal reasOning in cLosed Daily activities) framework.<n>It is built upon human understanding of daily real-world activities to reason about the causal nature of events.<n>We show that the proposed framework facilitates the creation of enormous causal queries.
arXiv Detail & Related papers (2024-11-29T06:37:13Z) - CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Causal Significance and Consistency [11.144164626192904]
Chain-based methods like chain of thought (CoT) play a rising role in solving reasoning tasks for large language models (LLMs)<n>This paper proposes a non-chain-based reasoning framework for simultaneous consideration of causal significance and consistency.
arXiv Detail & Related papers (2024-09-20T08:28:23Z) - Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks.<n>In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.<n>We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.<n>These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.