MANBench: Is Your Multimodal Model Smarter than Human?
- URL: http://arxiv.org/abs/2506.11080v1
- Date: Wed, 04 Jun 2025 08:42:14 GMT
- Title: MANBench: Is Your Multimodal Model Smarter than Human?
- Authors: Han Zhou, Qitong Xu, Yiheng Dong, Xin Yang,
- Abstract summary: We introduce MANBench, a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks.<n>We compared human performance against state-of-the-art Multimodal Large Language Models (MLLMs)<n>Results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks.<n>We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities.
- Score: 7.483339020254684
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework. Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image Understanding. Moreover, both humans and MLLMs face challenges in highly complex tasks like Puzzles and Spatial Imagination. MANBench highlights the strengths and limitations of MLLMs, revealing that even advanced models fall short of achieving human-level performance across many domains. We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities. The code and dataset are available at https://github.com/micdz/MANBench.
Related papers
- MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning [10.602434753538535]
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence.<n>Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models.<n>We find that current MLLMs perform poorly on MARBLE -- all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube.
arXiv Detail & Related papers (2025-06-28T19:44:32Z) - Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans [9.315735862658244]
We propose Human-Aligned Bench, a benchmark for alignment of multimodal reasoning with human performance.<n>We collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions.<n>Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance.
arXiv Detail & Related papers (2025-05-16T11:41:19Z) - On Path to Multimodal Generalist: General-Level and General-Bench [153.9720740167528]
This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality.<n>At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation.<n>The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists.
arXiv Detail & Related papers (2025-05-07T17:59:32Z) - Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z) - HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models [25.724995114710165]
We investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha.
Our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks.
arXiv Detail & Related papers (2024-03-10T12:43:27Z) - MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence.
Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding.
Human ToM, on the other hand, is more than video or text understanding.
People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z) - Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM)
While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested.
We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.<n>This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.