Decoding Stumpers: Large Language Models vs. Human Problem-Solvers
- URL: http://arxiv.org/abs/2310.16411v1
- Date: Wed, 25 Oct 2023 06:54:39 GMT
- Title: Decoding Stumpers: Large Language Models vs. Human Problem-Solvers
- Authors: Alon Goldstein, Miriam Havin, Roi Reichart and Ariel Goldstein
- Abstract summary: We compare the performance of four state-of-the-art Large Language Models to human participants.
New-generation LLMs excel in solving stumpers and surpass human performance.
Humans exhibit superior skills in verifying solutions to the same problems.
- Score: 14.12892960275563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the problem-solving capabilities of Large Language
Models (LLMs) by evaluating their performance on stumpers, unique single-step
intuition problems that pose challenges for human solvers but are easily
verifiable. We compare the performance of four state-of-the-art LLMs
(Davinci-2, Davinci-3, GPT-3.5-Turbo, GPT-4) to human participants. Our
findings reveal that the new-generation LLMs excel in solving stumpers and
surpass human performance. However, humans exhibit superior skills in verifying
solutions to the same problems. This research enhances our understanding of
LLMs' cognitive abilities and provides insights for enhancing their
problem-solving potential across various domains.
Related papers
- Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types.
We show that even the most advanced LLMs fail to solve these problems end-to-end in text.
Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z) - Easy Problems That LLMs Get Wrong [0.0]
We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs)
Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease.
arXiv Detail & Related papers (2024-05-30T02:09:51Z) - Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems.
We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z) - Predicting challenge moments from students' discourse: A comparison of
GPT-4 to two traditional natural language processing approaches [0.3826704341650507]
This study investigates the potential of leveraging three distinct natural language processing models.
An expert knowledge rule-based model, a supervised machine learning (ML) model and a Large Language model (LLM) were investigated.
The results show that the supervised ML and the LLM approaches performed considerably well in both tasks.
arXiv Detail & Related papers (2024-01-03T11:54:30Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - MacGyver: Are Large Language Models Creative Problem Solvers? [87.70522322728581]
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting.
We create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems.
We present our collection to both LLMs and humans to compare and contrast their problem-solving abilities.
arXiv Detail & Related papers (2023-11-16T08:52:27Z) - Evaluating the Deductive Competence of Large Language Models [0.2218292673050528]
We investigate whether several large language models (LLMs) can solve a classic type of deductive reasoning problem.
We do find performance differences between conditions; however, they do not improve overall performance.
We find that performance interacts with presentation format and content in unexpected ways that differ from human performance.
arXiv Detail & Related papers (2023-09-11T13:47:07Z) - Revisiting the Reliability of Psychological Scales on Large Language
Models [66.31055885857062]
This study aims to determine the reliability of applying personality assessments to Large Language Models (LLMs)
By shedding light on the personalization of LLMs, our study endeavors to pave the way for future explorations in this field.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Understanding the Usability Challenges of Machine Learning In
High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains.
In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions.
We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.