Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions
- URL: http://arxiv.org/abs/2410.12509v1
- Date: Wed, 16 Oct 2024 12:36:23 GMT
- Title: Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions
- Authors: Ilias Tachmazidis, Sotiris Batsakis, Grigoris Antoniou,
- Abstract summary: This paper proposes a benchmark that corresponds to various defeasible rule-based reasoning patterns.
We modified an existing benchmark for defeasible logic reasoners by translating defeasible rules into text suitable for Large Language Models.
We conducted preliminary experiments on nonmonotonic rule-based reasoning using ChatGPT and compared it with reasoning patterns defined by defeasible logic.
- Score: 0.36868085124383626
- License:
- Abstract: Large Language Models (LLMs) have gained prominence in the AI landscape due to their exceptional performance. Thus, it is essential to gain a better understanding of their capabilities and limitations, among others in terms of nonmonotonic reasoning. This paper proposes a benchmark that corresponds to various defeasible rule-based reasoning patterns. We modified an existing benchmark for defeasible logic reasoners by translating defeasible rules into text suitable for LLMs. We conducted preliminary experiments on nonmonotonic rule-based reasoning using ChatGPT and compared it with reasoning patterns defined by defeasible logic.
Related papers
- Rulebreakers Challenge: Revealing a Blind Spot in Large Language Models' Reasoning with Formal Logic [3.0648414540406703]
This study introduces the concept of "rulebreakers", which refers to instances where logical entailment diverges from factually acceptable inference.
We present RULEBREAKERS, a novel dataset for evaluating Large Language Models' ability to distinguish between rulebreakers and non-rulebreakers.
arXiv Detail & Related papers (2024-10-21T20:48:16Z) - P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains.
We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [87.34281749422756]
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks.
However, their mastery of underlying inferential rules still falls short of human capabilities.
We propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic.
arXiv Detail & Related papers (2024-02-18T03:38:51Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study
on Syllogism [19.590120229602103]
Large language models (LLMs) take advantage of step-by-step reasoning instructions, e.g., chain-of-thought (CoT) prompting.
In this study, we inspect the step-by-step reasoning ability of LLMs with a focus on negation.
arXiv Detail & Related papers (2023-10-23T12:40:41Z) - Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement [92.61557711360652]
Language models (LMs) often fall short on inductive reasoning, despite achieving impressive success on research benchmarks.
We conduct a systematic study of the inductive reasoning capabilities of LMs through iterative hypothesis refinement.
We reveal several discrepancies between the inductive reasoning processes of LMs and humans, shedding light on both the potentials and limitations of using LMs in inductive reasoning tasks.
arXiv Detail & Related papers (2023-10-12T17:51:10Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z) - RobustLR: Evaluating Robustness to Logical Perturbation in Deductive
Reasoning [25.319674132967553]
Transformers have been shown to be able to perform deductive reasoning on a logical rulebase containing rules and statements written in English natural language.
We propose RobustLR to evaluate the robustness of these models to minimal logical edits in rulebases.
We find that the models trained in prior works do not perform consistently on the different perturbations in RobustLR.
arXiv Detail & Related papers (2022-05-25T09:23:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.