Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology
- URL: http://arxiv.org/abs/2410.14979v5
- Date: Fri, 15 Nov 2024 12:46:30 GMT
- Title: Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology
- Authors: Wei Xie, Shuoyoucheng Ma, Zhenhua Wang, Enze Wang, Kai Chen, Xiaobing Sun, Baosheng Wang,
- Abstract summary: We show that, even with the use of Chains of Thought prompts, mainstream LLMs have a high error rate when solving modified CRT problems.
Specifically, the average accuracy rate dropped by up to 50% compared to the original questions.
This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans.
- Score: 13.964263002704582
- License:
- Abstract: The cognitive mechanism by which Large Language Models (LLMs) solve mathematical problems remains a widely debated and unresolved issue. Currently, there is little interpretable experimental evidence that connects LLMs' problem-solving with human cognitive psychology.To determine if LLMs possess human-like mathematical reasoning, we modified the problems used in the human Cognitive Reflection Test (CRT). Our results show that, even with the use of Chains of Thought (CoT) prompts, mainstream LLMs, including the latest o1 model (noted for its reasoning capabilities), have a high error rate when solving these modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original questions.Further analysis of LLMs' incorrect answers suggests that they primarily rely on pattern matching from their training data, which aligns more with human intuition (System 1 thinking) rather than with human-like reasoning (System 2 thinking). This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans. As a result, this work may adjust overly optimistic views on LLMs' progress towards artificial general intelligence.
Related papers
- On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes.
One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems.
We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z) - LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems [28.72485319617863]
LLMs struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the wordstrawberry.
We measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks.
Compared with strategies such as finetuning and in-context learning, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks.
arXiv Detail & Related papers (2024-10-18T04:17:16Z) - Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs.
We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z) - Cognitive phantoms in LLMs through the lens of latent variables [0.3441021278275805]
Large language models (LLMs) increasingly reach real-world applications, necessitating a better understanding of their behaviour.
Recent studies administering psychometric questionnaires to LLMs report human-like traits in LLMs, potentially influencing behaviour.
This approach suffers from a validity problem: it presupposes that these traits exist in LLMs and that they are measurable with tools designed for humans.
This study investigates this problem by comparing latent structures of personality between humans and three LLMs using two validated personality questionnaires.
arXiv Detail & Related papers (2024-09-06T12:42:35Z) - Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.
CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.
We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems.
We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z) - Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks.
LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes.
We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z) - Thinking Fast and Slow in Large Language Models [0.08057006406834465]
Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life.
In this study, we show that LLMs like GPT-3 exhibit behavior that resembles human-like intuition - and the cognitive errors that come with it.
arXiv Detail & Related papers (2022-12-10T05:07:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.