Related papers: Evaluating Interventional Reasoning Capabilities of Large Language Models

Evaluating Interventional Reasoning Capabilities of Large Language Models

URL: http://arxiv.org/abs/2404.05545v2
Date: Sun, 22 Dec 2024 12:22:53 GMT
Title: Evaluating Interventional Reasoning Capabilities of Large Language Models
Authors: Tejas Kasetty, Divyat Mahajan, Gintare Karolina Dziugaite, Alexandre Drouin, Dhanya Sridhar,
Abstract summary: Large language models (LLMs) are used to automate decision-making tasks.<n>In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.<n>We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.<n>These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
Score: 58.52919374786108
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consider using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. We evaluate six LLMs on the benchmarks, finding that GPT models show promising accuracy at predicting the intervention effects.

Related papers

IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery [61.15184885636171]
In the presence of confounding between an endogenous variable and the outcome, instrumental variables (IVs) are used to isolate the causal effect of the endogenous variable.<n>We investigate whether large language models (LLMs) can aid in this task.<n>We introduce IV Co-Scientist, a multi-agent system that proposes, critiques, and refines IVs for a given treatment-outcome pair.
arXiv Detail & Related papers (2026-02-08T12:28:29Z)
Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? [84.32752330104775]
Variation in human annotation (i.e., disagreements) is common in NLP.<n>We evaluate the influence of different reasoning settings on Large Language Model disagreement modeling.<n>Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling.
arXiv Detail & Related papers (2025-06-24T09:49:26Z)
On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study [15.617243755155686]
Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models.<n>We propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions.
arXiv Detail & Related papers (2025-05-17T04:59:32Z)
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies. Existing automated fact-checking evaluation methods rely on static datasets and classification metrics. We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z)
Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment [56.87031484108484]
Large Language Models (LLMs) are increasingly recognized for their practical applications. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs.
arXiv Detail & Related papers (2024-11-09T15:12:28Z)
The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead? [60.01746782465275]
Large Language Models (LLMs) have shown capabilities close to human performance in various analytical tasks. This paper investigates the efficiency and accuracy of LLMs in specialized tasks through a structured user study focusing on Human-LLM partnership.
arXiv Detail & Related papers (2024-10-07T02:30:18Z)
Do Influence Functions Work on Large Language Models? [10.463762448166714]
Influence functions are important for quantifying the impact of individual training data points on a model's predictions. We evaluate influence functions across multiple tasks and find that they consistently perform poorly in most settings.
arXiv Detail & Related papers (2024-09-30T06:50:18Z)
From Pre-training Corpora to Large Language Models: What Factors Influence LLM Performance in Causal Discovery Tasks? [51.42906577386907]
This study explores the factors influencing the performance of Large Language Models (LLMs) in causal discovery tasks. A higher frequency of causal mentions correlates with better model performance, suggesting that extensive exposure to causal information during training enhances the models' causal discovery capabilities.
arXiv Detail & Related papers (2024-07-29T01:45:05Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers [13.644277507363036]
We investigate whether these abilities are measurable outside of tailored prompting and MCQ. Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer. As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture.
arXiv Detail & Related papers (2024-06-21T08:56:35Z)
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey [25.732397636695882]
Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning. Despite these successes, the depth of LLMs' reasoning abilities remains uncertain.
arXiv Detail & Related papers (2024-04-02T11:46:31Z)
Explaining Large Language Models Decisions Using Shapley Values [1.223779595809275]
Large language models (LLMs) have opened up exciting possibilities for simulating human behavior and cognitive processes. However, the validity of utilizing LLMs as stand-ins for human subjects remains uncertain. This paper presents a novel approach based on Shapley values to interpret LLM behavior and quantify the relative contribution of each prompt component to the model's output.
arXiv Detail & Related papers (2024-03-29T22:49:43Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning [25.732397636695882]
We show that large language models (LLMs) display reasoning patterns akin to those observed in humans. Our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning.
arXiv Detail & Related papers (2024-02-20T12:58:14Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.