GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models
Evaluation
- URL: http://arxiv.org/abs/2402.15745v1
- Date: Sat, 24 Feb 2024 06:57:15 GMT
- Title: GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models
Evaluation
- Authors: Yi Zong, Xipeng Qiu
- Abstract summary: Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding.
We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO)
We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions.
- Score: 65.268245109828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Large Vision-Language Models (LVLMs) have demonstrated great abilities in
image perception and language understanding. However, existing multimodal
benchmarks focus on primary perception abilities and commonsense knowledge
which are insufficient to reflect the comprehensive capabilities of LVLMs. We
propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance
Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as
diagrams, function graphs, maps and photos. GAOKAO-MM derives from native
Chinese context and sets human-level requirements for the model's abilities,
including perception, understanding, knowledge and reasoning. We evaluate 10
LVLMs and find that the accuracies of all of them are lower than 50%, with
GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking
in the top three positions. The results of our multi-dimension analysis
indicate that LVLMs have moderate distance towards Artificial General
Intelligence (AGI) and provide insights facilitating the development of
multilingual LVLMs.
Related papers
- AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models [34.843603169616486]
We introduce AlignMMBench, a comprehensive alignment benchmark for emerging Chinese Vision-Language Models (VLMs)
This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios.
To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability.
arXiv Detail & Related papers (2024-06-13T16:30:14Z) - MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models [0.5822010906632046]
MultiPragEval is a robust test suite designed for the multilingual pragmatic evaluation of LLMs across English, German, Korean, and Chinese.
Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages.
arXiv Detail & Related papers (2024-06-11T21:46:03Z) - How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [114.22835695929682]
InternVL 1.5 is an open-source multimodal large language model (MLLM)
It bridges the capability gap between open-source and proprietary commercial models in multimodal understanding.
arXiv Detail & Related papers (2024-04-25T17:59:19Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept
Recognition in Large Vision Language Models [68.46457611340097]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models.
We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.