SEED-Bench-2: Benchmarking Multimodal Large Language Models
- URL: http://arxiv.org/abs/2311.17092v1
- Date: Tue, 28 Nov 2023 05:53:55 GMT
- Title: SEED-Bench-2: Benchmarking Multimodal Large Language Models
- Authors: Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang,
Ying Shan
- Abstract summary: Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
- Score: 67.28089415198338
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs), building upon the foundation of
powerful large language models (LLMs), have recently demonstrated exceptional
capabilities in generating not only texts but also images given interleaved
multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However,
existing MLLM benchmarks remain limited to assessing only models' comprehension
ability of single image-text inputs, failing to keep up with the strides made
in MLLMs. A comprehensive benchmark is imperative for investigating the
progress and uncovering the limitations of current MLLMs. In this work, we
categorize the capabilities of MLLMs into hierarchical levels from $L_0$ to
$L_4$ based on the modalities they can accept and generate, and propose
SEED-Bench-2, a comprehensive benchmark that evaluates the
\textbf{hierarchical} capabilities of MLLMs. Specifically, SEED-Bench-2
comprises 24K multiple-choice questions with accurate human annotations, which
spans 27 dimensions, including the evaluation of both text and image
generation. Multiple-choice questions with groundtruth options derived from
human annotation enables an objective and efficient assessment of model
performance, eliminating the need for human or GPT intervention during
evaluation. We further evaluate the performance of 23 prominent open-source
MLLMs and summarize valuable observations. By revealing the limitations of
existing MLLMs through extensive evaluations, we aim for SEED-Bench-2 to
provide insights that will motivate future research towards the goal of General
Artificial Intelligence. Dataset and evaluation code are available at
\href{https://github.com/AILab-CVC/SEED-Bench}
Related papers
- MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension [62.40482764691584]
We introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating textbftext-rich visual comprehension of MLLMs.
Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs.
We conduct a thorough evaluation involving 34 prominent MLLMs and emphasize the current limitations of MLLMs in text-rich visual comprehension.
arXiv Detail & Related papers (2024-04-25T17:39:35Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z) - SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension [27.53415400454066]
We introduce a benchmark named SEED-Bench to assess generative models.
SEED-Bench consists of 19K multiple choice questions with accurate human annotations.
We evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding.
arXiv Detail & Related papers (2023-07-30T04:25:16Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.