Measuring reasoning capabilities of ChatGPT
- URL: http://arxiv.org/abs/2310.05993v1
- Date: Sun, 8 Oct 2023 20:18:50 GMT
- Title: Measuring reasoning capabilities of ChatGPT
- Authors: Adrian Groza
- Abstract summary: I shall quantify the logical faults generated by ChatGPT when applied to reasoning tasks.
The library contains puzzles of various types, including arithmetic puzzles, logical equations, Sudoku-like puzzles, zebra-like puzzles, truth-telling puzzles, grid puzzles, strange numbers, or self-reference puzzles.
- Score: 1.3597551064547502
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: I shall quantify the logical faults generated by ChatGPT when applied to
reasoning tasks. For experiments, I use the 144 puzzles from the library
\url{https://users.utcluj.ro/~agroza/puzzles/maloga}~\cite{groza:fol}. The
library contains puzzles of various types, including arithmetic puzzles,
logical equations, Sudoku-like puzzles, zebra-like puzzles, truth-telling
puzzles, grid puzzles, strange numbers, or self-reference puzzles. The correct
solutions for these puzzles were checked using the theorem prover
Prover9~\cite{mccune2005release} and the finite models finder
Mace4~\cite{mccune2003mace4} based on human-modelling in Equational First Order
Logic. A first output of this study is the benchmark of 100 logical puzzles.
For this dataset ChatGPT provided both correct answer and justification for 7\%
only. %, while BARD for 5\%. Since the dataset seems challenging, the
researchers are invited to test the dataset on more advanced or tuned models
than ChatGPT3.5 with more crafted prompts. A second output is the
classification of reasoning faults conveyed by ChatGPT. This classification
forms a basis for a taxonomy of reasoning faults generated by large language
models. I have identified 67 such logical faults, among which: inconsistencies,
implication does not hold, unsupported claim, lack of commonsense, wrong
justification. The 100 solutions generated by ChatGPT contain 698 logical
faults. That is on average, 7 fallacies for each reasoning task. A third ouput
is the annotated answers of the ChatGPT with the corresponding logical faults.
Each wrong statement within the ChatGPT answer was manually annotated, aiming
to quantify the amount of faulty text generated by the language model. On
average, 26.03\% from the generated text was a logical fault.
Related papers
- On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes.
One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems.
We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z) - Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models [25.732397636695882]
We introduce $textitTruthQuest$, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles.
Evaluations show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks.
arXiv Detail & Related papers (2024-06-18T12:24:22Z) - Logic Query of Thoughts: Guiding Large Language Models to Answer Complex Logic Queries with Knowledge Graphs [102.37496443389203]
'Logic-Query-of-Thoughts' (LGOT) is the first of its kind to combine knowledge graph reasoning and large language models.
Our experimental findings demonstrate substantial performance enhancements, with up to 20% improvement over ChatGPT.
arXiv Detail & Related papers (2024-03-17T17:01:45Z) - Language Models can be Logical Solvers [99.40649402395725]
We introduce LoGiPT, a novel language model that directly emulates the reasoning processes of logical solvers.
LoGiPT is fine-tuned on a newly constructed instruction-tuning dataset derived from revealing and refining the invisible reasoning process of deductive solvers.
arXiv Detail & Related papers (2023-11-10T16:23:50Z) - LINC: A Neurosymbolic Approach for Logical Reasoning by Combining
Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society.
In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC.
We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z) - Solving and Generating NPR Sunday Puzzles with Large Language Models [0.0]
State-of-the-art large language models can solve many PUZZLEQA puzzles.
The best model achieves, GPT-3.5, 50.2% loose accuracy.
GPT-3.5 generates puzzles with answers that do not conform to the generated rules.
arXiv Detail & Related papers (2023-06-21T13:23:48Z) - True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3
and Challenging for GPT-4 [0.0]
Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities.
In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles.
We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles.
arXiv Detail & Related papers (2022-12-20T09:34:43Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z) - Natural language understanding for logical games [0.9594432031144714]
We developed a system able to automatically solve logical puzzles in natural language.
Our solution is composed by a and an inference module.
We also empower our software agent with the capability to provide Yes/No answers to natural language questions related to each puzzle.
arXiv Detail & Related papers (2021-10-01T17:36:14Z) - A new perspective of paramodulation complexity by solving massive 8
puzzles [0.4514386953429769]
A sliding puzzle is a combination puzzle where a player slide pieces along certain routes on a board to reach a certain end-configuration.
It turned out that by counting the number of clauses yielded with paramodulation, we can evaluate the difficulty of each puzzle.
arXiv Detail & Related papers (2020-12-15T11:47:47Z) - PuzzLing Machines: A Challenge on Learning From Small Data [64.513459448362]
We introduce a challenge on learning from small data, PuzzLing Machines, which consists of Rosetta Stone puzzles from Linguistic Olympiads for high school students.
Our challenge contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages.
We show that both simple statistical algorithms and state-of-the-art deep neural models perform inadequately on this challenge, as expected.
arXiv Detail & Related papers (2020-04-27T20:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.