Decoding Stumpers: Large Language Models vs. Human Problem-Solvers
- URL: http://arxiv.org/abs/2310.16411v1
- Date: Wed, 25 Oct 2023 06:54:39 GMT
- Title: Decoding Stumpers: Large Language Models vs. Human Problem-Solvers
- Authors: Alon Goldstein, Miriam Havin, Roi Reichart and Ariel Goldstein
- Abstract summary: We compare the performance of four state-of-the-art Large Language Models to human participants.
New-generation LLMs excel in solving stumpers and surpass human performance.
Humans exhibit superior skills in verifying solutions to the same problems.
- Score: 14.12892960275563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the problem-solving capabilities of Large Language
Models (LLMs) by evaluating their performance on stumpers, unique single-step
intuition problems that pose challenges for human solvers but are easily
verifiable. We compare the performance of four state-of-the-art LLMs
(Davinci-2, Davinci-3, GPT-3.5-Turbo, GPT-4) to human participants. Our
findings reveal that the new-generation LLMs excel in solving stumpers and
surpass human performance. However, humans exhibit superior skills in verifying
solutions to the same problems. This research enhances our understanding of
LLMs' cognitive abilities and provides insights for enhancing their
problem-solving potential across various domains.
Related papers
- Understanding LLMs' Fluid Intelligence Deficiency: An Analysis of the ARC Task [71.61879949813998]
In cognitive research, the latter ability is referred to as fluid intelligence, which is considered to be critical for assessing human intelligence.
Recent research on fluid intelligence assessments has highlighted significant deficiencies in LLMs' abilities.
Our study revealed three major limitations in existing LLMs: limited ability for skill composition, unfamiliarity with abstract input formats, and the intrinsic deficiency of left-to-right decoding.
arXiv Detail & Related papers (2025-02-11T02:31:09Z) - Humanlike Cognitive Patterns as Emergent Phenomena in Large Language Models [2.9312156642007294]
We systematically review Large Language Models' capabilities across three important cognitive domains: decision-making biases, reasoning, and creativity.
On decision-making, our synthesis reveals that while LLMs demonstrate several human-like biases, some biases observed in humans are absent.
On reasoning, advanced LLMs like GPT-4 exhibit deliberative reasoning akin to human System-2 thinking, while smaller models fall short of human-level performance.
A distinct dichotomy emerges in creativity: while LLMs excel in language-based creative tasks, such as storytelling, they struggle with divergent thinking tasks that require real-world context.
arXiv Detail & Related papers (2024-12-20T02:26:56Z) - BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
We introduce BloomWise, a new prompting technique, inspired by Bloom's taxonomy, to improve the performance of Large Language Models (LLMs)
The decision regarding the need to employ more sophisticated cognitive skills is based on self-evaluation performed by the LLM.
In extensive experiments across 4 popular math reasoning datasets, we have demonstrated the effectiveness of our proposed approach.
arXiv Detail & Related papers (2024-10-05T09:27:52Z) - Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems.
We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z) - Predicting challenge moments from students' discourse: A comparison of
GPT-4 to two traditional natural language processing approaches [0.3826704341650507]
This study investigates the potential of leveraging three distinct natural language processing models.
An expert knowledge rule-based model, a supervised machine learning (ML) model and a Large Language model (LLM) were investigated.
The results show that the supervised ML and the LLM approaches performed considerably well in both tasks.
arXiv Detail & Related papers (2024-01-03T11:54:30Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - MacGyver: Are Large Language Models Creative Problem Solvers? [87.70522322728581]
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting.
We create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems.
We present our collection to both LLMs and humans to compare and contrast their problem-solving abilities.
arXiv Detail & Related papers (2023-11-16T08:52:27Z) - Evaluating the Deductive Competence of Large Language Models [0.2218292673050528]
We investigate whether several large language models (LLMs) can solve a classic type of deductive reasoning problem.
We do find performance differences between conditions; however, they do not improve overall performance.
We find that performance interacts with presentation format and content in unexpected ways that differ from human performance.
arXiv Detail & Related papers (2023-09-11T13:47:07Z) - Understanding the Usability Challenges of Machine Learning In
High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains.
In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions.
We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.