Related papers: PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

URL: http://arxiv.org/abs/2401.16355v3
Date: Wed, 20 Mar 2024 17:13:53 GMT
Title: PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
Authors: Yuxuan Sun, Hao Wu, Chenglu Zhu, Sunyi Zheng, Qizi Chen, Kai Zhang, Yunlong Zhang, Dan Wan, Xiaoxiao Lan, Mengyue Zheng, Jingxiong Li, Xinheng Lyu, Tao Lin, Lin Yang,
Abstract summary: We introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs) It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As.
Score: 14.944207181507135
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As in a cascading process. Significantly, to maximize PathMMU's authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, significantly smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology.

Related papers

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [63.23935582919081]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) We introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs. We conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights.
arXiv Detail & Related papers (2025-02-13T18:59:46Z)
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark [10.20074702234283]
We develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding.
arXiv Detail & Related papers (2024-10-24T17:59:38Z)
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio [118.75449542080746]
This paper presents the first systematic investigation of hallucinations in large multimodal models (LMMs) Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning.
arXiv Detail & Related papers (2024-10-16T17:59:02Z)
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark [77.93283927871758]
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities.
arXiv Detail & Related papers (2024-09-04T15:31:26Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation [55.2480439325792]
Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO) We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions.
arXiv Detail & Related papers (2024-02-24T06:57:15Z)
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark [53.24896036161829]
We introduce a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context.
arXiv Detail & Related papers (2024-01-22T13:34:34Z)
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI [64.21953221846596]
MMMU is a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks. Questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU.
arXiv Detail & Related papers (2023-11-27T17:33:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.