SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
- URL: http://arxiv.org/abs/2307.16125v2
- Date: Wed, 2 Aug 2023 08:02:35 GMT
- Title: SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
- Authors: Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan
- Abstract summary: We introduce a benchmark named SEED-Bench to assess generative models.
SEED-Bench consists of 19K multiple choice questions with accurate human annotations.
We evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding.
- Score: 27.53415400454066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Based on powerful Large Language Models (LLMs), recent generative Multimodal
Large Language Models (MLLMs) have gained prominence as a pivotal research
area, exhibiting remarkable capability for both comprehension and generation.
In this work, we address the evaluation of generative comprehension in MLLMs as
a preliminary step towards a comprehensive assessment of generative models, by
introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple
choice questions with accurate human annotations (x 6 larger than existing
benchmarks), which spans 12 evaluation dimensions including the comprehension
of both the image and video modality. We develop an advanced pipeline for
generating multiple-choice questions that target specific evaluation
dimensions, integrating both automatic filtering and manual verification
processes. Multiple-choice questions with groundtruth options derived from
human annotation enables an objective and efficient assessment of model
performance, eliminating the need for human or GPT intervention during
evaluation. We further evaluate the performance of 18 models across all 12
dimensions, covering both the spatial and temporal understanding. By revealing
the limitations of existing MLLMs through evaluation results, we aim for
SEED-Bench to provide insights for motivating future research. We will launch
and consistently maintain a leaderboard to provide a platform for the community
to assess and investigate model capability.
Related papers
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.