Related papers: Decoding Stumpers: Large Language Models vs. Human Problem-Solvers

Decoding Stumpers: Large Language Models vs. Human Problem-Solvers

URL: http://arxiv.org/abs/2310.16411v1
Date: Wed, 25 Oct 2023 06:54:39 GMT
Title: Decoding Stumpers: Large Language Models vs. Human Problem-Solvers
Authors: Alon Goldstein, Miriam Havin, Roi Reichart and Ariel Goldstein
Abstract summary: We compare the performance of four state-of-the-art Large Language Models to human participants. New-generation LLMs excel in solving stumpers and surpass human performance. Humans exhibit superior skills in verifying solutions to the same problems.
Score: 14.12892960275563
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper investigates the problem-solving capabilities of Large Language Models (LLMs) by evaluating their performance on stumpers, unique single-step intuition problems that pose challenges for human solvers but are easily verifiable. We compare the performance of four state-of-the-art LLMs (Davinci-2, Davinci-3, GPT-3.5-Turbo, GPT-4) to human participants. Our findings reveal that the new-generation LLMs excel in solving stumpers and surpass human performance. However, humans exhibit superior skills in verifying solutions to the same problems. This research enhances our understanding of LLMs' cognitive abilities and provides insights for enhancing their problem-solving potential across various domains.

Related papers

TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning [26.680686158061192]
Reasoning is a fundamental capability of large language models (LLMs) This paper introduces TextGames, a benchmark specifically crafted to assess LLMs through demanding text-based games. Our findings reveal that although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks.
arXiv Detail & Related papers (2025-02-25T18:26:48Z)
Understanding LLMs' Fluid Intelligence Deficiency: An Analysis of the ARC Task [71.61879949813998]
In cognitive research, the latter ability is referred to as fluid intelligence, which is considered to be critical for assessing human intelligence. Recent research on fluid intelligence assessments has highlighted significant deficiencies in LLMs' abilities. Our study revealed three major limitations in existing LLMs: limited ability for skill composition, unfamiliarity with abstract input formats, and the intrinsic deficiency of left-to-right decoding.
arXiv Detail & Related papers (2025-02-11T02:31:09Z)
Humanlike Cognitive Patterns as Emergent Phenomena in Large Language Models [2.9312156642007294]
We systematically review Large Language Models' capabilities across three important cognitive domains: decision-making biases, reasoning, and creativity. On decision-making, our synthesis reveals that while LLMs demonstrate several human-like biases, some biases observed in humans are absent. On reasoning, advanced LLMs like GPT-4 exhibit deliberative reasoning akin to human System-2 thinking, while smaller models fall short of human-level performance. A distinct dichotomy emerges in creativity: while LLMs excel in language-based creative tasks, such as storytelling, they struggle with divergent thinking tasks that require real-world context.
arXiv Detail & Related papers (2024-12-20T02:26:56Z)
BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
We introduce BloomWise, a new prompting technique, inspired by Bloom's taxonomy, to improve the performance of Large Language Models (LLMs) The decision regarding the need to employ more sophisticated cognitive skills is based on self-evaluation performed by the LLM. In extensive experiments across 4 popular math reasoning datasets, we have demonstrated the effectiveness of our proposed approach.
arXiv Detail & Related papers (2024-10-05T09:27:52Z)
Easy Problems That LLMs Get Wrong [0.0]
We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease.
arXiv Detail & Related papers (2024-05-30T02:09:51Z)
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z)
Predicting challenge moments from students' discourse: A comparison of GPT-4 to two traditional natural language processing approaches [0.3826704341650507]
This study investigates the potential of leveraging three distinct natural language processing models. An expert knowledge rule-based model, a supervised machine learning (ML) model and a Large Language model (LLM) were investigated. The results show that the supervised ML and the LLM approaches performed considerably well in both tasks.
arXiv Detail & Related papers (2024-01-03T11:54:30Z)
Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z)
MacGyver: Are Large Language Models Creative Problem Solvers? [87.70522322728581]
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. We create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems. We present our collection to both LLMs and humans to compare and contrast their problem-solving abilities.
arXiv Detail & Related papers (2023-11-16T08:52:27Z)
Evaluating the Deductive Competence of Large Language Models [0.2218292673050528]
We investigate whether several large language models (LLMs) can solve a classic type of deductive reasoning problem. We do find performance differences between conditions; however, they do not improve overall performance. We find that performance interacts with presentation format and content in unexpected ways that differ from human performance.
arXiv Detail & Related papers (2023-09-11T13:47:07Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)
Understanding the Usability Challenges of Machine Learning In High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains. In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions. We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.