A Critical Review of Causal Reasoning Benchmarks for Large Language Models
- URL: http://arxiv.org/abs/2407.08029v1
- Date: Wed, 10 Jul 2024 20:11:51 GMT
- Title: A Critical Review of Causal Reasoning Benchmarks for Large Language Models
- Authors: Linying Yang, Vik Shirvaikar, Oscar Clivio, Fabian Falck,
- Abstract summary: We present a comprehensive overview of LLM benchmarks for causality.
We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy.
- Score: 2.1311710788645617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerous benchmarks aim to evaluate the capabilities of Large Language Models (LLMs) for causal inference and reasoning. However, many of them can likely be solved through the retrieval of domain knowledge, questioning whether they achieve their purpose. In this review, we present a comprehensive overview of LLM benchmarks for causality. We highlight how recent benchmarks move towards a more thorough definition of causal reasoning by incorporating interventional or counterfactual reasoning. We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy. We hope this work will pave the way towards a general framework for the assessment of causal understanding in LLMs and the design of novel benchmarks.
Related papers
- Evaluating Step-by-step Reasoning Traces: A Survey [3.895864050325129]
We propose a taxonomy of evaluation criteria with four top-level categories (groundedness, validity, coherence, and utility)
We then categorize metrics based on their implementations, survey which metrics are used for assessing each criterion, and explore whether evaluator models can transfer across different criteria.
arXiv Detail & Related papers (2025-02-17T19:58:31Z) - CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models [5.409370027524351]
We evaluate the performance of large language models (LLMs) in counterfactual reasoning.
We introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions.
arXiv Detail & Related papers (2025-02-16T06:19:37Z) - RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs)
Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z) - Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.
This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.
We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks.
In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.
We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.
These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z) - Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning [38.60086807496399]
Large language models (LLMs) have been shown to perform better when asked to reason step-by-step before answering a question.
It is unclear to what degree the model's final answer is faithful to the stated reasoning steps.
We introduce FRODO, a framework to tailor small-sized LMs to generate correct reasoning steps and robustly reason over these steps.
arXiv Detail & Related papers (2024-02-21T17:23:59Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round.
This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.