MULTI: Multimodal Understanding Leaderboard with Text and Images
- URL: http://arxiv.org/abs/2402.03173v3
- Date: Tue, 07 Jan 2025 07:05:05 GMT
- Title: MULTI: Multimodal Understanding Leaderboard with Text and Images
- Authors: Zichen Zhu, Yang Xu, Lu Chen, Jingkai Yang, Yichuan Ma, Yiming Sun, Hailin Wen, Jiaqi Liu, Jinyu Cai, Yingzi Ma, Situo Zhang, Zihan Zhao, Liangtai Sun, Kai Yu,
- Abstract summary: We present MULTI, a Chinese multimodal dataset derived from authentic examination questions.<n> MULTI evaluates models using real-world examination standards, encompassing image-text comprehension, complex reasoning, and knowledge recall.<n>Our evaluation highlights substantial room for MLLM advancement, with Qwen2-VL-72B achieving a 76.9% accuracy on MULTI and 53.1% on MULTI-Elite leading 25 evaluated models.
- Score: 24.04211732343361
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid development of multimodal large language models (MLLMs) raises the question of how they compare to human performance. While existing datasets often feature synthetic or overly simplistic tasks, some models have already surpassed human expert baselines. In this paper, we present MULTI, a Chinese multimodal dataset derived from authentic examination questions. Comprising over 18,000 carefully selected and refined questions, MULTI evaluates models using real-world examination standards, encompassing image-text comprehension, complex reasoning, and knowledge recall. Additionally, We also introduce MULTI-Elite, a 500-question selected hard subset, and MULTI-Extend with more than 4,500 external knowledge context pieces for testing in-context learning capabilities. Our evaluation highlights substantial room for MLLM advancement, with Qwen2-VL-72B achieving a 76.9% accuracy on MULTI and 53.1% on MULTI-Elite leading 25 evaluated models, compared to human expert baselines of 86.1% and 73.1%. MULTI serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.
Related papers
- VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.
These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.
Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z) - ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks [43.509761349059914]
ProBench is a benchmark of open-ended user queries that require professional expertise and advanced reasoning.
It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing.
ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning.
arXiv Detail & Related papers (2025-03-10T03:29:18Z) - EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.
We evaluated 19 leading proprietary and open-source MLLMs within EmbodiedBench.
MLLMs excel at high-level tasks but struggle with low-level manipulation.
arXiv Detail & Related papers (2025-02-13T18:11:34Z) - Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench [17.73279547506514]
We introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning.
MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives.
Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation and cloze tasks, while multimodal unlearning approaches perform better in classification tasks with multimodal inputs.
arXiv Detail & Related papers (2024-10-29T15:07:23Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs.
The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively.
A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting [0.6675160100853794]
We curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems.
For generating answers for questions consisting of multimodal input, we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset.
For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models.
arXiv Detail & Related papers (2024-04-11T07:11:47Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z) - An Improved Traditional Chinese Evaluation Suite for Foundation Model [15.669799471464676]
We present TMMLU+, a new benchmark designed for Traditional Chinese language understanding.
It is a multi-choice question-answering dataset with 66 subjects from elementary to professional level.
We also benchmark closed-source models and 26 open-weight Chinese large language models (LLMs) of parameters ranging from 1.8B to 72B.
arXiv Detail & Related papers (2024-03-04T09:13:33Z) - MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark [41.68821233828375]
This paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities.
Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking.
arXiv Detail & Related papers (2024-02-07T12:28:32Z) - MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [42.68425777473114]
Vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity.
We introduce vision-language Model with Multi-Modal In-Context Learning (MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently.
Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks.
arXiv Detail & Related papers (2023-09-14T17:59:17Z) - Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM)
While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested.
We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.
This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.