Related papers: GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

URL: http://arxiv.org/abs/2504.02782v1
Date: Thu, 03 Apr 2025 17:23:16 GMT
Title: GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
Authors: Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan,
Abstract summary: OpenAI's GPT4o model has demonstrated surprisingly good capabilities in image generation and editing.<n>This report presents the first-look evaluation benchmark (named GPT-ImgEval)<n>We show GPT-4o's performance across three critical dimensions: generation quality, (2) editing proficiency, and (3) world knowledge-informed synthesis.
Score: 28.235805447825896
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.

Related papers

Can ChatGPT Perform Image Splicing Detection? A Preliminary Study [0.0]
Multimodal Large Language Models (MLLMs) like GPT-4V are capable of reasoning across text and image modalities.<n>We evaluate GPT-4V using three prompting strategies: Zero-Shot (ZS), Few-Shot (FS), and Chain-of-Thought (CoT)<n>Our results show that GPT-4V achieves competitive detection performance in zero-shot settings (more than 85% accuracy)
arXiv Detail & Related papers (2025-05-22T13:53:53Z)
A Preliminary Study for GPT-4o on Image Restoration [7.784948465884567]
OpenAI's GPT-4o model has demonstrated unprecedented performance in image generation.<n>We present the first systematic evaluation of GPT-4o across diverse restoration tasks.
arXiv Detail & Related papers (2025-05-08T20:00:11Z)
Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability [6.586119023242877]
OpenAI's multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing. But its ability to achieve world knowledge-informed semantic synthesis remains unproven. Our study calls for the development of more robust benchmarks and training strategies.
arXiv Detail & Related papers (2025-04-09T16:10:15Z)
An Empirical Study of GPT-4o Image Generation Capabilities [40.86026243294732]
We conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling.
arXiv Detail & Related papers (2025-04-08T12:34:36Z)
Exploiting GPT-4 Vision for Zero-shot Point Cloud Understanding [114.4754255143887]
We tackle the challenge of classifying the object category in point clouds. We employ GPT-4 Vision (GPT-4V) to overcome these challenges. We set a new benchmark in zero-shot point cloud classification.
arXiv Detail & Related papers (2024-01-15T10:16:44Z)
Gemini Pro Defeated by GPT-4V: Evidence from Education [1.0226894006814744]
GPT-4V significantly outperforms Gemini Pro in terms of scoring accuracy and Quadratic Weighted Kappa. Findings suggest GPT-4V's superior capability in handling complex educational tasks.
arXiv Detail & Related papers (2023-12-27T02:56:41Z)
DeepArt: A Benchmark to Advance Fidelity Research in AI-Generated Content [9.482738088610535]
This paper explores the image synthesis capabilities of GPT-4, a leading multi-modal large language model. We establish a benchmark for evaluating the fidelity of texture features in images generated by GPT-4, comprising manually painted pictures and their AI-generated counterparts. We have compiled a unique benchmark of manual drawings and corresponding GPT-4-generated images, introducing a new task to advance fidelity research in AI-generated content.
arXiv Detail & Related papers (2023-12-16T10:17:09Z)
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds. Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z)
Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks [53.936643052339]
We evaluate the reasoning abilities of text-only and multimodal versions of GPT-4. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.
arXiv Detail & Related papers (2023-11-14T04:33:49Z)
Holistic Evaluation of GPT-4V for Biomedical Imaging [113.46226609088194]
GPT-4V represents a breakthrough in artificial general intelligence for computer vision. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization.
arXiv Detail & Related papers (2023-11-10T18:40:44Z)
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs. GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system. GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z)
SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning [5.177947445379688]
This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis. Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification. The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations.
arXiv Detail & Related papers (2023-07-16T05:33:35Z)
Can GPT-4 Perform Neural Architecture Search? [56.98363718371614]
We investigate the potential of GPT-4 to perform Neural Architecture Search (NAS) Our proposed approach, textbfGPT-4 textbfEnhanced textbfNeural archtextbfItecttextbfUre textbfSearch (GENIUS) We assess GENIUS across several benchmarks, comparing it with existing state-of-the-art NAS techniques to illustrate its effectiveness.
arXiv Detail & Related papers (2023-04-21T14:06:44Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.