Related papers: Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks

Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks

URL: http://arxiv.org/abs/2311.09247v3
Date: Mon, 11 Dec 2023 23:57:17 GMT
Title: Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks
Authors: Melanie Mitchell, Alessandro B. Palmarini, Arseny Moskvichev
Abstract summary: We evaluate the reasoning abilities of text-only and multimodal versions of GPT-4. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.
Score: 53.936643052339
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4, on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.

Related papers

The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting [44.99833362998488]
Legal contracts possess an inherent, semantically vital structure (e.g., sections, clauses) that is crucial for human comprehension.<n>This paper investigates the effects of explicit input text structure and prompt engineering on the performance of GPT-4o and GPT-4.1 on a legal question-answering task.
arXiv Detail & Related papers (2025-05-19T08:25:21Z)
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation [28.235805447825896]
OpenAI's GPT4o model has demonstrated surprisingly good capabilities in image generation and editing. This report presents the first-look evaluation benchmark (named GPT-ImgEval) We show GPT-4o's performance across three critical dimensions: generation quality, (2) editing proficiency, and (3) world knowledge-informed synthesis.
arXiv Detail & Related papers (2025-04-03T17:23:16Z)
Notes on Applicability of GPT-4 to Document Understanding [0.0]
We evaluate all publicly available GPT-4 family models concerning the Document Understanding field. Benchmark results indicate that though it is hard to achieve satisfactory results with text-only models, GPT-4 Vision Turbo performs well when one provides both text recognized by an external OCR engine and document images on the input.
arXiv Detail & Related papers (2024-05-28T17:59:53Z)
Exploiting GPT-4 Vision for Zero-shot Point Cloud Understanding [114.4754255143887]
We tackle the challenge of classifying the object category in point clouds. We employ GPT-4 Vision (GPT-4V) to overcome these challenges. We set a new benchmark in zero-shot point cloud classification.
arXiv Detail & Related papers (2024-01-15T10:16:44Z)
GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition [38.2581985358104]
GPT-4 with Vision (GPT-4V) has demonstrated remarkable visual capabilities across various tasks, but its performance in emotion recognition has not been fully evaluated. We present the quantitative evaluation results of GPT-4V on 21 benchmark datasets covering 6 tasks.
arXiv Detail & Related papers (2023-12-07T13:27:37Z)
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds. Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z)
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z)
An Early Evaluation of GPT-4V(ision) [40.866323649060696]
We evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio. To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V.
arXiv Detail & Related papers (2023-10-25T10:33:17Z)
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs. GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system. GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.