Are Large Language Models Really Good Logical Reasoners? A Comprehensive
Evaluation and Beyond
- URL: http://arxiv.org/abs/2306.09841v3
- Date: Tue, 8 Aug 2023 12:57:18 GMT
- Title: Are Large Language Models Really Good Logical Reasoners? A Comprehensive
Evaluation and Beyond
- Authors: Fangzhi Xu, Qika Lin, Jiawei Han, Tianzhe Zhao, Jun Liu, Erik Cambria
- Abstract summary: Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP)
We aim to bridge this gap and provide comprehensive evaluations in this paper.
- Score: 32.797832207443896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Logical reasoning consistently plays a fundamental and significant role in
the domains of knowledge engineering and artificial intelligence. Recently,
Large Language Models (LLMs) have emerged as a noteworthy innovation in natural
language processing (NLP), exhibiting impressive achievements across various
classic NLP tasks. However, the question of whether LLMs can effectively
address the task of logical reasoning, which requires gradual cognitive
inference similar to human intelligence, remains unanswered. To this end, we
aim to bridge this gap and provide comprehensive evaluations in this paper.
Firstly, to offer systematic evaluations, we select fifteen typical logical
reasoning datasets and organize them into deductive, inductive, abductive and
mixed-form reasoning settings. Considering the comprehensiveness of
evaluations, we include three representative LLMs (i.e., text-davinci-003,
ChatGPT and BARD) and evaluate them on all selected datasets under zero-shot,
one-shot and three-shot settings. Secondly, different from previous evaluations
relying only on simple metrics (e.g., accuracy), we propose fine-level
evaluations from objective and subjective manners, covering both answers and
explanations. Additionally, to uncover the logical flaws of LLMs, problematic
cases will be attributed to five error types from two dimensions, i.e.,
evidence selection process and reasoning process. Thirdly, to avoid the
influences of knowledge bias and purely focus on benchmarking the logical
reasoning capability of LLMs, we propose a new dataset with neutral content. It
contains 3,000 samples and covers deductive, inductive and abductive settings.
Based on the in-depth evaluations, this paper finally forms a general
evaluation scheme of logical reasoning capability from six dimensions. It
reflects the pros and cons of LLMs and gives guiding directions for future
works.
Related papers
- Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study [34.29839553042609]
We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions.<n>We conduct a study on the effects of supervision format during fine-tuning.<n>Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks.
arXiv Detail & Related papers (2025-06-05T09:34:12Z) - Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment [54.62926010621013]
We introduce a novel task, code reasoning, to provide a new perspective for the reasoning abilities of large language models.
We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks.
We present a new pathway exploration pipeline inspired by human intricate problem-solving methods.
arXiv Detail & Related papers (2025-02-17T10:39:58Z) - Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.
This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.
We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models [46.26140720993383]
Multi-LogiEval is a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths.
We conduct evaluations on a range of Large Language Models including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral.
arXiv Detail & Related papers (2024-06-24T23:02:56Z) - Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities [31.728976421529577]
We investigate the contrast across abstract and contextualized logical problems from a comprehensive set of domains.
We focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning.
Our experiments aim to provide insights into disentangling context in logical reasoning and the true reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-06-04T21:25:06Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - LINC: A Neurosymbolic Approach for Logical Reasoning by Combining
Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society.
In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC.
We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z) - Language Models Are Greedy Reasoners: A Systematic Formal Analysis of
Chain-of-Thought [10.524051272257614]
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts.
We present a new synthetic question-answering dataset called PrOntoQA, where each example is generated as a synthetic world model.
This allows us to parse the generated chain-of-thought into symbolic proofs for formal analysis.
arXiv Detail & Related papers (2022-10-03T21:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.