GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs
- URL: http://arxiv.org/abs/2506.00991v2
- Date: Tue, 05 Aug 2025 09:17:28 GMT
- Title: GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs
- Authors: Xiaorong Zhu, Ziheng Jia, Jiarui Wang, Xiangyu Zhao, Haodong Duan, Xiongkuo Min, Jia Wang, Zicheng Zhang, Guangtao Zhai,
- Abstract summary: We introduce GOBench, the first benchmark to evaluate MLLMs' ability across two tasks: Generating Optically Authentic Imagery and Understanding Underlying Optical Phenomena.<n>We use MLLMs to construct GOBench-Gen-1k dataset. We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity.<n>For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding.
- Score: 66.55945133516776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs' ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k dataset.We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs' generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35\% accuracy in optical understanding. Database and codes are publicly available at https://github.com/aiben-ch/GOBench.
Related papers
- VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions [23.294711275107606]
This paper introduces Geoperception, a benchmark to evaluate an MLLM's ability to accurately transcribe 2D geometric information from an image.<n>We then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks.<n>We develop Euclid, a family of models specifically optimized for strong low-level geometric perception.
arXiv Detail & Related papers (2024-12-11T19:12:13Z) - EAGLE: Elevating Geometric Reasoning through LLM-empowered Visual Instruction Tuning [16.631783647518706]
Existing MLLMs mostly optimize the LLM backbone to acquire geometric reasoning capabilities, while rarely emphasizing improvements in visual comprehension.
Our findings reveal that current MLLMs severely suffer from inaccurate geometric perception and hallucinations.
We propose a novel two-stage end-to-end visual enhancement MLLM framework designed to ElevAte Geometric reasoning.
arXiv Detail & Related papers (2024-08-21T07:43:50Z) - II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models [49.070801221350486]
multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks.<n>We propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images.
arXiv Detail & Related papers (2024-06-09T17:25:47Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information [32.57246173437492]
Vision detection models excel at recognizing fine-grained image details.<n>One effective strategy is to infuse detection information in text format, which has proven simple and effective.<n>This paper addresses the question: How does training impact MLLMs' understanding of infused textual detection information?
arXiv Detail & Related papers (2024-01-31T16:38:32Z) - AesBench: An Expert Benchmark for Multimodal Large Language Models on
Image Aesthetics Perception [64.25808552299905]
AesBench is an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs.
We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts.
We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI)
arXiv Detail & Related papers (2024-01-16T10:58:07Z) - Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models.
We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.