MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical
Vision-Language Models
- URL: http://arxiv.org/abs/2402.09262v2
- Date: Fri, 16 Feb 2024 16:36:00 GMT
- Title: MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical
Vision-Language Models
- Authors: Corentin Royer, Bjoern Menze and Anjany Sekuboyina
- Abstract summary: MultiMedEval is an open-source toolkit for fair and reproducible evaluation of large, medical vision-language models (VLM)
It comprehensively assesses the models' performance on a broad array of six multi-modal tasks, conducted over 23 datasets, and spanning over 11 medical domains.
We open-source a Python toolkit with a simple interface and setup process, enabling the evaluation of any VLM in just a few lines of code.
- Score: 1.3535643703577176
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce MultiMedEval, an open-source toolkit for fair and reproducible
evaluation of large, medical vision-language models (VLM). MultiMedEval
comprehensively assesses the models' performance on a broad array of six
multi-modal tasks, conducted over 23 datasets, and spanning over 11 medical
domains. The chosen tasks and performance metrics are based on their widespread
adoption in the community and their diversity, ensuring a thorough evaluation
of the model's overall generalizability. We open-source a Python toolkit
(github.com/corentin-ryr/MultiMedEval) with a simple interface and setup
process, enabling the evaluation of any VLM in just a few lines of code. Our
goal is to simplify the intricate landscape of VLM evaluation, thus promoting
fair and uniform benchmarking of future models.
Related papers
- WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation [4.149844666297669]
Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide.
Existing datasets are largely text-only and available in a limited subset of languages and countries.
WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries.
arXiv Detail & Related papers (2024-10-16T16:31:24Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks.
Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks [11.094602017349928]
We propose FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes.
We thoroughly evaluate the performance and fairness of eight state-of-the-art open-source MLLMs, including both general and medical.
All data and code will be released upon acceptance.
arXiv Detail & Related papers (2024-10-01T21:38:15Z) - VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models [89.63342806812413]
We present an open-source toolkit for evaluating large multi-modality models based on PyTorch.
VLMEvalKit implements over 70 different large multi-modality models, including both proprietary APIs and open-source models.
We host OpenVLM Leaderboard to track the progress of multi-modality learning research.
arXiv Detail & Related papers (2024-07-16T13:06:15Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - MELINDA: A Multimodal Dataset for Biomedical Experiment Method
Classification [14.820951153262685]
We introduce a new dataset, MELINDA, for Multimodal biomEdicaL experImeNt methoD clAssification.
The dataset is collected in a fully automated distant supervision manner, where the labels are obtained from an existing curated database.
We benchmark various state-of-the-art NLP and computer vision models, including unimodal models which only take either caption texts or images as inputs.
arXiv Detail & Related papers (2020-12-16T19:11:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.