A Critical Review of Causal Reasoning Benchmarks for Large Language Models
- URL: http://arxiv.org/abs/2407.08029v1
- Date: Wed, 10 Jul 2024 20:11:51 GMT
- Title: A Critical Review of Causal Reasoning Benchmarks for Large Language Models
- Authors: Linying Yang, Vik Shirvaikar, Oscar Clivio, Fabian Falck,
- Abstract summary: We present a comprehensive overview of LLM benchmarks for causality.
We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy.
- Score: 2.1311710788645617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerous benchmarks aim to evaluate the capabilities of Large Language Models (LLMs) for causal inference and reasoning. However, many of them can likely be solved through the retrieval of domain knowledge, questioning whether they achieve their purpose. In this review, we present a comprehensive overview of LLM benchmarks for causality. We highlight how recent benchmarks move towards a more thorough definition of causal reasoning by incorporating interventional or counterfactual reasoning. We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy. We hope this work will pave the way towards a general framework for the assessment of causal understanding in LLMs and the design of novel benchmarks.
Related papers
- Improving Causal Reasoning in Large Language Models: A Survey [16.55801836321059]
Causal reasoning is a crucial aspect of intelligence, essential for problem-solving, decision-making, and understanding the world.
Large language models (LLMs) can generate rationales for their outputs, but their ability to reliably perform causal reasoning remains uncertain.
arXiv Detail & Related papers (2024-10-22T04:18:19Z) - Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions [0.36868085124383626]
This paper proposes a benchmark that corresponds to various defeasible rule-based reasoning patterns.
We modified an existing benchmark for defeasible logic reasoners by translating defeasible rules into text suitable for Large Language Models.
We conducted preliminary experiments on nonmonotonic rule-based reasoning using ChatGPT and compared it with reasoning patterns defined by defeasible logic.
arXiv Detail & Related papers (2024-10-16T12:36:23Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules.
Each module requires benchmark designers to describe, justify, and support benchmark design choices.
Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning [38.60086807496399]
Large language models (LLMs) have been shown to perform better when asked to reason step-by-step before answering a question.
It is unclear to what degree the model's final answer is faithful to the stated reasoning steps.
We introduce FRODO, a framework to tailor small-sized LMs to generate correct reasoning steps and robustly reason over these steps.
arXiv Detail & Related papers (2024-02-21T17:23:59Z) - Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning [25.732397636695882]
We show that large language models (LLMs) display reasoning patterns akin to those observed in humans.
Our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning.
arXiv Detail & Related papers (2024-02-20T12:58:14Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round.
This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.