Related papers: II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

URL: http://arxiv.org/abs/2406.05862v2
Date: Tue, 11 Jun 2024 12:33:42 GMT
Title: II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
Authors: Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, Hongquan Lin, Jiaming Li, Yuansheng Ni, Haihong Wu, Yaswanth Narsupalli, Zhigang Zheng, Chengming Li, Xiping Hu, Ruifeng Xu, Xiaojun Chen, Min Yang, Jiaheng Liu, Ruibo Liu, Wenhao Huang, Ge Zhang, Shiwen Ni,
Abstract summary: multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. We propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images.
Score: 49.070801221350486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.

Related papers

GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs [66.55945133516776]
We introduce GOBench, the first benchmark to evaluate MLLMs' ability across two tasks: Generating Optically Authentic Imagery and Understanding Underlying Optical Phenomena.<n>We use MLLMs to construct GOBench-Gen-1k dataset. We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity.<n>For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding.
arXiv Detail & Related papers (2025-06-01T12:46:14Z)
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs [41.072699990427374]
Multi-view understanding is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. We propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 real-world scenes. Our experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Can MLLMs Understand the Deep Implication Behind Chinese Images? [29.007010549079098]
We introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. Images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture.
arXiv Detail & Related papers (2024-10-17T17:59:24Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding. It aims to localize instances of interest across multiple images based on open-ended text prompts. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models [57.280853324896306]
Multimodal large language models (MLLMs) struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. We introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. We propose Divide, Conquer and Combine (DC$2$), a novel training-free framework for enhancing MLLM perception of HR images.
arXiv Detail & Related papers (2024-08-28T06:09:02Z)
MileBench: Benchmarking MLLMs in Long Context [31.211260223575092]
We introduce MileBench, a benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. We systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Results show that while the closed-source GPT-4o outperforms others, most open-source MLLMs struggle in long-context situations.
arXiv Detail & Related papers (2024-04-29T09:19:05Z)
The Instinctive Bias: Spurious Images lead to Illusion in MLLMs [34.91795817316696]
We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers. We propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images. We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
arXiv Detail & Related papers (2024-02-06T06:48:46Z)
AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception [64.25808552299905]
AesBench is an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs. We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts. We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI)
arXiv Detail & Related papers (2024-01-16T10:58:07Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models. We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.