Related papers: Failure Modes of LLMs for Causal Reasoning on Narratives

Failure Modes of LLMs for Causal Reasoning on Narratives

URL: http://arxiv.org/abs/2410.23884v1
Date: Thu, 31 Oct 2024 12:48:58 GMT
Title: Failure Modes of LLMs for Causal Reasoning on Narratives
Authors: Khurram Yamin, Shantanu Gupta, Gaurav R. Ghosal, Zachary C. Lipton, Bryan Wilder,
Abstract summary: We investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge.
Score: 51.19592551510628
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge. For example, LLMs tend to determine causal relationships based on the topological ordering of events (i.e., earlier events cause later ones), resulting in lower performance whenever events are not narrated in their exact causal order. Similarly, we demonstrate that LLMs struggle with long-term causal reasoning and often fail when the narratives are long and contain many events. Additionally, we show LLMs appear to rely heavily on their parametric knowledge at the expense of reasoning over the provided narrative. This degrades their abilities whenever the narrative opposes parametric knowledge. We extensively validate these failure modes through carefully controlled synthetic experiments, as well as evaluations on real-world narratives. Finally, we observe that explicitly generating a causal graph generally improves performance while naive chain-of-thought is ineffective. Collectively, our results distill precise failure modes of current state-of-the-art models and can pave the way for future techniques to enhance causal reasoning in LLMs.

Related papers

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching [50.65932158912512]
We propose a new causal reasoning benchmark, CausalFlip, to encourage the development of new large language models.<n>CaulFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations.<n>We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought supervision, and a proposed internalized causal reasoning approach.
arXiv Detail & Related papers (2026-02-23T18:06:15Z)
Learning Nonlinear Causal Reductions to Explain Reinforcement Learning Policies [50.30741668990102]
We take a causal perspective on explaining the behavior of reinforcement learning policies.<n>We learn a simplified high-level causal model that explains these relationships.<n>We prove that for a class of nonlinear causal models, there exists a unique solution.
arXiv Detail & Related papers (2025-07-20T10:25:24Z)
Causal Cartographer: From Mapping to Reasoning Over Counterfactual Worlds [9.153187514369849]
Causal world models can answer counterfactual questions about an environment of interest.<n>It requires understanding the underlying causes behind chains of events and conducting causal inference for unseen distributions.<n>We show that our approach can extract causal knowledge while reducing inference costs and spurious correlations.
arXiv Detail & Related papers (2025-05-20T14:14:05Z)
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning. We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics. Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z)
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning [49.58786377307728]
This paper adopts an exploratory approach by introducing a controlled evaluation environment for analogical reasoning.<n>We analyze the comparative dynamics of inductive, abductive, and deductive inference pipelines.<n>We investigate advanced paradigms such as hypothesis selection, verification, and refinement, revealing their potential to scale up logical inference.
arXiv Detail & Related papers (2025-02-16T15:54:53Z)
COLD: Causal reasOning in cLosed Daily activities [7.782872276680731]
We propose the COLD (Causal reasOning in cLosed Daily activities) framework. It is built upon human understanding of daily real-world activities to reason about the causal nature of events. We show that the proposed framework facilitates the creation of enormous causal queries.
arXiv Detail & Related papers (2024-11-29T06:37:13Z)
Language Agents Meet Causality -- Bridging LLMs and Causal World Models [50.79984529172807]
We propose a framework that integrates causal representation learning with large language models. This framework learns a causal world model, with causal variables linked to natural language expressions. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities.
arXiv Detail & Related papers (2024-10-25T18:36:37Z)
Counterfactual Causal Inference in Natural Language with Large Language Models [9.153187514369849]
We propose an end-to-end causal structure discovery and causal inference method from natural language. We first use an LLM to extract the instantiated causal variables from text data and build a causal graph. We then conduct counterfactual inference on the estimated graph.
arXiv Detail & Related papers (2024-10-08T21:53:07Z)
From Pre-training Corpora to Large Language Models: What Factors Influence LLM Performance in Causal Discovery Tasks? [51.42906577386907]
This study explores the factors influencing the performance of Large Language Models (LLMs) in causal discovery tasks. A higher frequency of causal mentions correlates with better model performance, suggesting that extensive exposure to causal information during training enhances the models' causal discovery capabilities.
arXiv Detail & Related papers (2024-07-29T01:45:05Z)
LLMs Are Prone to Fallacies in Causal Inference [33.9881589703843]
Recent work shows that causal facts can be effectively extracted from LLMs through prompting. This work investigates if this success is limited to explicitly-mentioned causal facts in the pretraining data which the model can memorize.
arXiv Detail & Related papers (2024-06-18T00:14:07Z)
Evaluating Consistency and Reasoning Capabilities of Large Language Models [0.0]
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs.
arXiv Detail & Related papers (2024-04-25T10:03:14Z)
Cause and Effect: Can Large Language Models Truly Understand Causality? [1.2334534968968969]
This research proposes a novel architecture called Context Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework. The proposed framework incorporates an explicit causal detection module with ConceptNet and counterfactual statements, as well as implicit causal detection through Large Language Models. The knowledge from ConceptNet enhances the performance of multiple causal reasoning tasks such as causal discovery, causal identification and counterfactual reasoning.
arXiv Detail & Related papers (2024-02-28T08:02:14Z)
How Likely Do LLMs with CoT Mimic Human Reasoning? [31.86489714330338]
Chain-of-thought (CoT) emerges as a promising technique to elicit reasoning capabilities from Large Language Models (LLMs) In this paper, we diagnose the underlying mechanism by comparing the reasoning process of LLMs with humans. Our empirical study reveals that LLMs often deviate from a causal chain, resulting in spurious correlations and potential consistency errors.
arXiv Detail & Related papers (2024-02-25T10:13:04Z)
CLadder: Assessing Causal Reasoning in Language Models [82.8719238178569]
We investigate whether large language models (LLMs) can coherently reason about causality. We propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al.
arXiv Detail & Related papers (2023-12-07T15:12:12Z)
Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination? [73.454943870226]
This work studies a specific type of hallucination induced by semantic associations. To quantify this phenomenon, we propose a novel probing method and benchmark called EureQA.
arXiv Detail & Related papers (2023-11-16T09:27:36Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
From Heuristic to Analytic: Cognitively Motivated Strategies for Coherent Physical Commonsense Reasoning [66.98861219674039]
Heuristic-Analytic Reasoning (HAR) strategies drastically improve the coherence of rationalizations for model decisions. Our findings suggest that human-like reasoning strategies can effectively improve the coherence and reliability of PLM reasoning.
arXiv Detail & Related papers (2023-10-24T19:46:04Z)
Concise and Organized Perception Facilitates Reasoning in Large Language Models [32.71672086718057]
We show that large language models (LLMs) exhibit failure patterns akin to human-like cognitive biases when dealing with disordered and irrelevant content in reasoning tasks. We propose a novel reasoning approach named Concise and Organized Perception (COP) COP carefully analyzes the given statements to identify the most pertinent information while eliminating redundancy efficiently.
arXiv Detail & Related papers (2023-10-05T04:47:49Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning [76.00395335702572]
A central goal for AI and causality is the joint discovery of abstract representations and causal structure. Existing environments for studying causal induction are poorly suited for this objective because they have complicated task-specific causal graphs. In this work, our goal is to facilitate research in learning representations of high-level variables as well as causal structures among them.
arXiv Detail & Related papers (2021-07-02T05:44:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.