Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models
- URL: http://arxiv.org/abs/2406.12546v2
- Date: Mon, 07 Oct 2024 19:07:37 GMT
- Title: Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models
- Authors: Philipp Mondorf, Barbara Plank,
- Abstract summary: We introduce $textitTruthQuest$, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles.
Evaluations show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks.
- Score: 25.732397636695882
- License:
- Abstract: Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character's identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce $\textit{TruthQuest}$, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical statements involved. Evaluations on $\textit{TruthQuest}$ show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks. A detailed error analysis of the models' output reveals that lower-performing models exhibit a diverse range of reasoning errors, frequently failing to grasp the concept of truth and lies. In comparison, more proficient models primarily struggle with accurately inferring the logical implications of potentially false statements.
Related papers
- On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes.
One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems.
We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - NL2FOL: Translating Natural Language to First-Order Logic for Logical Fallacy Detection [45.28949266878263]
We design a process to reliably detect logical fallacies by translating natural language to First-order Logic.
We then utilize Satisfiability Modulo Theory (SMT) solvers to reason about the validity of the formula.
Our approach is robust, interpretable and does not require training data or fine-tuning.
arXiv Detail & Related papers (2024-04-18T00:20:48Z) - Language Models can be Logical Solvers [99.40649402395725]
We introduce LoGiPT, a novel language model that directly emulates the reasoning processes of logical solvers.
LoGiPT is fine-tuned on a newly constructed instruction-tuning dataset derived from revealing and refining the invisible reasoning process of deductive solvers.
arXiv Detail & Related papers (2023-11-10T16:23:50Z) - Empower Nested Boolean Logic via Self-Supervised Curriculum Learning [67.46052028752327]
We find that any pre-trained language models even including large language models only behave like a random selector in the face of multi-nested logic.
To empower language models with this fundamental capability, this paper proposes a new self-supervised learning method textitCurriculum Logical Reasoning (textscClr)
arXiv Detail & Related papers (2023-10-09T06:54:02Z) - Measuring reasoning capabilities of ChatGPT [1.3597551064547502]
I shall quantify the logical faults generated by ChatGPT when applied to reasoning tasks.
The library contains puzzles of various types, including arithmetic puzzles, logical equations, Sudoku-like puzzles, zebra-like puzzles, truth-telling puzzles, grid puzzles, strange numbers, or self-reference puzzles.
arXiv Detail & Related papers (2023-10-08T20:18:50Z) - MetaLogic: Logical Reasoning Explanations with Fine-Grained Structure [129.8481568648651]
We propose a benchmark to investigate models' logical reasoning capabilities in complex real-life scenarios.
Based on the multi-hop chain of reasoning, the explanation form includes three main components.
We evaluate the current best models' performance on this new explanation form.
arXiv Detail & Related papers (2022-10-22T16:01:13Z) - Logic Embeddings for Complex Query Answering [56.25151854231117]
We propose Logic Embeddings, a new approach to embedding complex queries that uses Skolemisation to eliminate existential variables for efficient querying.
We show that Logic Embeddings are competitively fast and accurate in query answering over large, incomplete knowledge graphs, outperform on negation queries, and in particular, provide improved modeling of answer uncertainty.
arXiv Detail & Related papers (2021-02-28T07:52:37Z) - Notes on neighborhood semantics for logics of unknown truths and false
beliefs [1.827510863075184]
We study logics of unknown truths and false beliefs under neighborhood semantics.
It turns out that they are incomparable over various classes of neighborhood models.
We extend the results to the case of public announcements.
arXiv Detail & Related papers (2020-02-22T04:27:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.