Related papers: Do Large Language Models Solve ARC Visual Analogies Like People Do?

Do Large Language Models Solve ARC Visual Analogies Like People Do?

URL: http://arxiv.org/abs/2403.09734v2
Date: Mon, 13 May 2024 11:20:23 GMT
Title: Do Large Language Models Solve ARC Visual Analogies Like People Do?
Authors: Gustaw Opiełka, Hannes Rosenbusch, Veerle Vijverberg, Claire E. Stevenson,
Abstract summary: We compared human and large language model (LLM) performance on a new child-friendly set of ARC items. Results show that both children and adults outperform most LLMs on these tasks.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The Abstraction Reasoning Corpus (ARC) is a visual analogical reasoning test designed for humans and machines (Chollet, 2019). We compared human and large language model (LLM) performance on a new child-friendly set of ARC items. Results show that both children and adults outperform most LLMs on these tasks. Error analysis revealed a similar "fallback" solution strategy in LLMs and young children, where part of the analogy is simply copied. In addition, we found two other error types, one based on seemingly grasping key concepts (e.g., Inside-Outside) and the other based on simple combinations of analogy input matrices. On the whole, "concept" errors were more common in humans, and "matrix" errors were more common in LLMs. This study sheds new light on LLM reasoning ability and the extent to which we can use error analyses and comparisons with human development to understand how LLMs solve visual analogies.

Related papers

Modeling Understanding of Story-Based Analogies Using Large Language Models [1.4999444543328293]
Recent advancements in Large Language Models have brought them closer to matching human cognition across a variety of tasks.<n>How well do these models align with human performance in detecting and mapping analogies?
arXiv Detail & Related papers (2025-07-15T03:40:21Z)
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks. We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs. These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z)
Child vs. machine language learning: Can the logical structure of human language unleash LLMs? [0.0]
We argue that human language learning proceeds in a manner that is different in nature from current approaches to training LLMs. We present evidence from German plural formation by LLMs that confirm our hypothesis that even very powerful implementations produce results that miss aspects of the logic inherent to language that humans have no problem with.
arXiv Detail & Related papers (2025-02-24T16:40:46Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
LLM+AL: Bridging Large Language Models and Action Languages for Complex Reasoning about Actions [7.575628120822444]
"LLM+AL" is a method that bridges the natural language understanding capabilities of LLMs with the symbolic reasoning strengths of action languages. We compare "LLM+AL" against state-of-the-art LLMs, including ChatGPT-4, Claude 3 Opus, Gemini Ultra 1.0, and o1-preview. Our findings indicate that, although all methods exhibit errors, LLM+AL, with relatively minimal human corrections, consistently leads to correct answers.
arXiv Detail & Related papers (2025-01-01T13:20:01Z)
Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology [13.964263002704582]
We show that, even with the use of Chains of Thought prompts, mainstream LLMs have a high error rate when solving modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original questions. This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans.
arXiv Detail & Related papers (2024-10-19T05:01:56Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
Cognitive phantoms in LLMs through the lens of latent variables [0.3441021278275805]
Large language models (LLMs) increasingly reach real-world applications, necessitating a better understanding of their behaviour. Recent studies administering psychometric questionnaires to LLMs report human-like traits in LLMs, potentially influencing behaviour. This approach suffers from a validity problem: it presupposes that these traits exist in LLMs and that they are measurable with tools designed for humans. This study investigates this problem by comparing latent structures of personality between humans and three LLMs using two validated personality questionnaires.
arXiv Detail & Related papers (2024-09-06T12:42:35Z)
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z)
Towards Measuring Representational Similarity of Large Language Models [1.7228514699394508]
We measure the similarity of representations of a set of large language models with 7B parameters. Our results suggest that some LLMs are substantially different from others. We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions.
arXiv Detail & Related papers (2023-12-05T12:48:04Z)
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans. RaR serves as a simple yet effective prompting method for improving performance. We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z)
Democratizing Reasoning Ability: Tailored Learning from Large Language Model [97.4921006089966]
We propose a tailored learning approach to distill such reasoning ability to smaller LMs. We exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm. To exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes.
arXiv Detail & Related papers (2023-10-20T07:50:10Z)
Verbosity Bias in Preference Labeling by Large Language Models [10.242500241407466]
We examine the biases that come along with evaluating Large Language Models (LLMs) We take a closer look into verbosity bias -- a bias where LLMs sometimes prefer more verbose answers even if they have similar qualities.
arXiv Detail & Related papers (2023-10-16T05:19:02Z)
Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning [101.26814728062065]
Large Language Models (LLMs) have shown human-like reasoning abilities but still struggle with complex logical problems. This paper introduces a novel framework, Logic-LM, which integrates LLMs with symbolic solvers to improve logical problem-solving.
arXiv Detail & Related papers (2023-05-20T22:25:38Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.