Evaluating Interventional Reasoning Capabilities of Large Language Models
- URL: http://arxiv.org/abs/2404.05545v2
- Date: Sun, 22 Dec 2024 12:22:53 GMT
- Title: Evaluating Interventional Reasoning Capabilities of Large Language Models
- Authors: Tejas Kasetty, Divyat Mahajan, Gintare Karolina Dziugaite, Alexandre Drouin, Dhanya Sridhar,
- Abstract summary: Large language models (LLMs) are used to automate decision-making tasks.
In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.
We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.
These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
- Score: 58.52919374786108
- License:
- Abstract: Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consider using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. We evaluate six LLMs on the benchmarks, finding that GPT models show promising accuracy at predicting the intervention effects.
Related papers
- Do Influence Functions Work on Large Language Models? [10.463762448166714]
Influence functions are important for quantifying the impact of individual training data points on a model's predictions.
We evaluate influence functions across multiple tasks and find that they consistently perform poorly in most settings.
arXiv Detail & Related papers (2024-09-30T06:50:18Z) - From Pre-training Corpora to Large Language Models: What Factors Influence LLM Performance in Causal Discovery Tasks? [51.42906577386907]
This study explores the factors influencing the performance of Large Language Models (LLMs) in causal discovery tasks.
A higher frequency of causal mentions correlates with better model performance, suggesting that extensive exposure to causal information during training enhances the models' causal discovery capabilities.
arXiv Detail & Related papers (2024-07-29T01:45:05Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers [13.644277507363036]
We investigate whether these abilities are measurable outside of tailored prompting and MCQ.
Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer.
As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture.
arXiv Detail & Related papers (2024-06-21T08:56:35Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey [25.732397636695882]
Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning.
Despite these successes, the depth of LLMs' reasoning abilities remains uncertain.
arXiv Detail & Related papers (2024-04-02T11:46:31Z) - Explaining Large Language Models Decisions Using Shapley Values [1.223779595809275]
Large language models (LLMs) have opened up exciting possibilities for simulating human behavior and cognitive processes.
However, the validity of utilizing LLMs as stand-ins for human subjects remains uncertain.
This paper presents a novel approach based on Shapley values to interpret LLM behavior and quantify the relative contribution of each prompt component to the model's output.
arXiv Detail & Related papers (2024-03-29T22:49:43Z) - Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs.
Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods.
This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z) - Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning [25.732397636695882]
We show that large language models (LLMs) display reasoning patterns akin to those observed in humans.
Our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning.
arXiv Detail & Related papers (2024-02-20T12:58:14Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.