How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
- URL: http://arxiv.org/abs/2507.01955v2
- Date: Wed, 23 Jul 2025 10:52:38 GMT
- Title: How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
- Authors: Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir,
- Abstract summary: We benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks.<n>We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized framework.
- Score: 11.628499518700572
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.
Related papers
- Evaluating Visual Mathematics in Multimodal LLMs: A Multilingual Benchmark Based on the Kangaroo Tests [2.0176279176431744]
Multimodal Large Language Models (MLLMs) promise advanced vision language capabilities, yet their effectiveness in visually presented mathematics remains underexplored.<n>This paper analyzes the development and evaluation of MLLMs for mathematical problem solving, focusing on diagrams, multilingual text, and symbolic notation.<n>We then assess several models, including GPT 4o, Pixtral, Qwen VL, Llama 3.2 Vision variants, and Gemini 2.0 Flash in a multilingual Kangaroo style benchmark spanning English, French, Spanish, and Catalan.
arXiv Detail & Related papers (2025-06-09T04:35:02Z) - Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields [56.184278668305076]
We introduce Feature4X, a universal framework to extend functionality from 2D vision foundation model into the 4D realm.<n>The framework is first to distill and lift the features of video foundation models into an explicit 4D feature field using Splatting.<n>Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps.
arXiv Detail & Related papers (2025-03-26T17:56:16Z) - Improved Alignment of Modalities in Large Vision Language Models [1.4561960744147884]
We propose a training strategy of auto-regressive vision-language models.<n>We propose four training stages for aligning the vision model with the language model.<n>We also devise different attention masks for training transformer-based language models.
arXiv Detail & Related papers (2025-03-25T09:59:46Z) - Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report) [6.789534723913505]
Large language models (LLMs) enable users to protect data privacy by eliminating the need to provide data to third parties.
We compare the performance of various language models on the Sustainable Development Goal mapping task.
According to the results of this study, LLaMA 2 and Gemma still have significant room for improvement.
arXiv Detail & Related papers (2024-08-05T03:05:02Z) - CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples [34.71588837946776]
We propose CounterCurate, a framework to improve visio-linguistic compositional reasoning.
In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning.
We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning.
We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements.
arXiv Detail & Related papers (2024-02-20T18:59:55Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter [19.830089364830066]
ArtGPT-4 is a large vision-language model tailored to address the limitations of existing models in artistic comprehension.
It can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation.
arXiv Detail & Related papers (2023-05-12T14:04:30Z) - MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
Language Models [41.84885546518666]
GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text.
We present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced large language model.
We also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images.
arXiv Detail & Related papers (2023-04-20T18:25:35Z) - Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks [86.66733026149892]
We propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-gnostic tasks.
Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model.
Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
arXiv Detail & Related papers (2022-11-17T18:59:52Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z) - Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models.
Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity.
The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.