Related papers: SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

URL: http://arxiv.org/abs/2505.22126v1
Date: Wed, 28 May 2025 08:51:01 GMT
Title: SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model
Authors: Yifan Chang, Yukang Feng, Jianwen Sun, Jiaxin Ai, Chuanhao Li, S. Kevin Zhou, Kaipeng Zhang,
Abstract summary: SridBench is the first benchmark for scientific figure generation.<n>It comprises 1,120 instances from leading scientific papers across 13 natural and computer science disciplines.<n>Results reveal that even top-tier models like GPT-4o-image lag behind human performance.
Score: 21.81341169834812
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.

Related papers

GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs [66.55945133516776]
We introduce GOBench, the first benchmark to evaluate MLLMs' ability across two tasks: Generating Optically Authentic Imagery and Understanding Underlying Optical Phenomena.<n>We use MLLMs to construct GOBench-Gen-1k dataset. We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity.<n>For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding.
arXiv Detail & Related papers (2025-06-01T12:46:14Z)
Preliminary Explorations with GPT-4o(mni) Native Image Generation [7.700772640399941]
Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI.<n>In this paper, we aim to explore the capabilities of GPT-4o across various tasks.
arXiv Detail & Related papers (2025-05-06T19:35:29Z)
An Empirical Study of GPT-4o Image Generation Capabilities [40.86026243294732]
We conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models.<n>Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling.
arXiv Detail & Related papers (2025-04-08T12:34:36Z)
Generative Physical AI in Vision: A Survey [78.07014292304373]
Gene Artificial Intelligence (AI) has rapidly advanced the field of computer vision by enabling machines to create and interpret visual data with unprecedented sophistication.<n>This transformation builds upon a foundation of generative models to produce realistic images, videos, and 3D/4D content.<n>As generative models evolve to increasingly integrate physical realism and dynamic simulation, their potential to function as "world simulators" expands.
arXiv Detail & Related papers (2025-01-19T03:19:47Z)
KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models. We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z)
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z)
PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge. We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z)
Multimodal Deep Learning for Scientific Imaging Interpretation [0.0]
This study presents a novel methodology to linguistically emulate and evaluate human-like interactions with Scanning Electron Microscopy (SEM) images. Our approach distills insights from both textual and visual data harvested from peer-reviewed articles. Our model (GlassLLaVA) excels in crafting accurate interpretations, identifying key features, and detecting defects in previously unseen SEM images.
arXiv Detail & Related papers (2023-09-21T20:09:22Z)
GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images [79.39247661907397]
We introduce an effective framework Generalizable Model-based Neural Radiance Fields to synthesize free-viewpoint images. Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy.
arXiv Detail & Related papers (2023-03-24T03:32:02Z)
Perception Over Time: Temporal Dynamics for Robust Image Understanding [5.584060970507506]
Deep learning surpasses human-level performance in narrow and specific vision tasks. Human visual perception is orders of magnitude more robust to changes in the input stimulus. We introduce a novel method of incorporating temporal dynamics into static image understanding.
arXiv Detail & Related papers (2022-03-11T21:11:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.