A Critical Review of Causal Reasoning Benchmarks for Large Language Models
- URL: http://arxiv.org/abs/2407.08029v1
- Date: Wed, 10 Jul 2024 20:11:51 GMT
- Title: A Critical Review of Causal Reasoning Benchmarks for Large Language Models
- Authors: Linying Yang, Vik Shirvaikar, Oscar Clivio, Fabian Falck,
- Abstract summary: We present a comprehensive overview of LLM benchmarks for causality.
We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy.
- Score: 2.1311710788645617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerous benchmarks aim to evaluate the capabilities of Large Language Models (LLMs) for causal inference and reasoning. However, many of them can likely be solved through the retrieval of domain knowledge, questioning whether they achieve their purpose. In this review, we present a comprehensive overview of LLM benchmarks for causality. We highlight how recent benchmarks move towards a more thorough definition of causal reasoning by incorporating interventional or counterfactual reasoning. We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy. We hope this work will pave the way towards a general framework for the assessment of causal understanding in LLMs and the design of novel benchmarks.
Related papers
- MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark that demands a meta reasoning skill.
MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules.
Each module requires benchmark designers to describe, justify, and support benchmark design choices.
Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey [25.732397636695882]
Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning.
Despite these successes, the depth of LLMs' reasoning abilities remains uncertain.
arXiv Detail & Related papers (2024-04-02T11:46:31Z) - Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning [38.60086807496399]
Large language models (LLMs) have been shown to perform better when asked to reason step-by-step before answering a question.
It is unclear to what degree the model's final answer is faithful to the stated reasoning steps.
We introduce FRODO, a framework to tailor small-sized LMs to generate correct reasoning steps and robustly reason over these steps.
arXiv Detail & Related papers (2024-02-21T17:23:59Z) - Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning [25.732397636695882]
We show that large language models (LLMs) display reasoning patterns akin to those observed in humans.
Our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning.
arXiv Detail & Related papers (2024-02-20T12:58:14Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round.
This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.