Related papers: CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models

CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models

URL: http://arxiv.org/abs/2502.11008v1
Date: Sun, 16 Feb 2025 06:19:37 GMT
Title: CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models
Authors: Yuefei Chen, Vivek K. Singh, Jing Ma, Ruxiang Tang,
Abstract summary: We evaluate the performance of large language models (LLMs) in counterfactual reasoning.<n>We introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions.
Score: 5.409370027524351
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.

Related papers

Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning [34.427730009102966]
We develop an automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs.<n>Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.
arXiv Detail & Related papers (2025-02-08T19:49:32Z)
CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models [18.975064947089805]
Causal reasoning capabilities are essential for large language models (LLMs) in a wide range of applications, such as education and healthcare.<n>We provide a benchmark, named by CARL-GT, which evaluates CAusal Reasoning capabilities of large Language models using Graphs and Tabular data.
arXiv Detail & Related papers (2024-12-23T20:34:32Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models [0.0]
Large language models (LLMs) have generated significant attention since their inception, finding applications across various academic and industrial domains.<n>LLMs often suffer from the "hallucination problem", where outputs, though grammatically and logically coherent, lack factual accuracy or are entirely fabricated.
arXiv Detail & Related papers (2024-08-09T14:34:32Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark [39.64489055580211]
We introduce a Step-wise Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data. Our experimental results reveal a significant performance gap between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks.
arXiv Detail & Related papers (2024-02-19T08:12:30Z)
CLadder: Assessing Causal Reasoning in Language Models [82.8719238178569]
We investigate whether large language models (LLMs) can coherently reason about causality. We propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al.
arXiv Detail & Related papers (2023-12-07T15:12:12Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.