Decoding Stumpers: Large Language Models vs. Human Problem-Solvers
- URL: http://arxiv.org/abs/2310.16411v1
- Date: Wed, 25 Oct 2023 06:54:39 GMT
- Title: Decoding Stumpers: Large Language Models vs. Human Problem-Solvers
- Authors: Alon Goldstein, Miriam Havin, Roi Reichart and Ariel Goldstein
- Abstract summary: We compare the performance of four state-of-the-art Large Language Models to human participants.
New-generation LLMs excel in solving stumpers and surpass human performance.
Humans exhibit superior skills in verifying solutions to the same problems.
- Score: 14.12892960275563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the problem-solving capabilities of Large Language
Models (LLMs) by evaluating their performance on stumpers, unique single-step
intuition problems that pose challenges for human solvers but are easily
verifiable. We compare the performance of four state-of-the-art LLMs
(Davinci-2, Davinci-3, GPT-3.5-Turbo, GPT-4) to human participants. Our
findings reveal that the new-generation LLMs excel in solving stumpers and
surpass human performance. However, humans exhibit superior skills in verifying
solutions to the same problems. This research enhances our understanding of
LLMs' cognitive abilities and provides insights for enhancing their
problem-solving potential across various domains.
Related papers
- BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
We introduce BloomWise, a new prompting technique, inspired by Bloom's taxonomy, to improve the performance of Large Language Models (LLMs)
The decision regarding the need to employ more sophisticated cognitive skills is based on self-evaluation performed by the LLM.
In extensive experiments across 4 popular math reasoning datasets, we have demonstrated the effectiveness of our proposed approach.
arXiv Detail & Related papers (2024-10-05T09:27:52Z) - Easy Problems That LLMs Get Wrong [0.0]
We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs)
Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease.
arXiv Detail & Related papers (2024-05-30T02:09:51Z) - Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems.
We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z) - Predicting challenge moments from students' discourse: A comparison of
GPT-4 to two traditional natural language processing approaches [0.3826704341650507]
This study investigates the potential of leveraging three distinct natural language processing models.
An expert knowledge rule-based model, a supervised machine learning (ML) model and a Large Language model (LLM) were investigated.
The results show that the supervised ML and the LLM approaches performed considerably well in both tasks.
arXiv Detail & Related papers (2024-01-03T11:54:30Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - MacGyver: Are Large Language Models Creative Problem Solvers? [87.70522322728581]
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting.
We create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems.
We present our collection to both LLMs and humans to compare and contrast their problem-solving abilities.
arXiv Detail & Related papers (2023-11-16T08:52:27Z) - Evaluating the Deductive Competence of Large Language Models [0.2218292673050528]
We investigate whether several large language models (LLMs) can solve a classic type of deductive reasoning problem.
We do find performance differences between conditions; however, they do not improve overall performance.
We find that performance interacts with presentation format and content in unexpected ways that differ from human performance.
arXiv Detail & Related papers (2023-09-11T13:47:07Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Understanding the Usability Challenges of Machine Learning In
High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains.
In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions.
We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.