Related papers: MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models

MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models

URL: http://arxiv.org/abs/2402.09262v2
Date: Fri, 16 Feb 2024 16:36:00 GMT
Title: MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models
Authors: Corentin Royer, Bjoern Menze and Anjany Sekuboyina
Abstract summary: MultiMedEval is an open-source toolkit for fair and reproducible evaluation of large, medical vision-language models (VLM) It comprehensively assesses the models' performance on a broad array of six multi-modal tasks, conducted over 23 datasets, and spanning over 11 medical domains. We open-source a Python toolkit with a simple interface and setup process, enabling the evaluation of any VLM in just a few lines of code.
Score: 1.3535643703577176
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce MultiMedEval, an open-source toolkit for fair and reproducible evaluation of large, medical vision-language models (VLM). MultiMedEval comprehensively assesses the models' performance on a broad array of six multi-modal tasks, conducted over 23 datasets, and spanning over 11 medical domains. The chosen tasks and performance metrics are based on their widespread adoption in the community and their diversity, ensuring a thorough evaluation of the model's overall generalizability. We open-source a Python toolkit (github.com/corentin-ryr/MultiMedEval) with a simple interface and setup process, enabling the evaluation of any VLM in just a few lines of code. Our goal is to simplify the intricate landscape of VLM evaluation, thus promoting fair and uniform benchmarking of future models.

Related papers

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation [4.149844666297669]
Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide. Existing datasets are largely text-only and available in a limited subset of languages and countries. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries.
arXiv Detail & Related papers (2024-10-16T16:31:24Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z)
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks [11.094602017349928]
We propose FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes. We thoroughly evaluate the performance and fairness of eight state-of-the-art open-source MLLMs, including both general and medical. All data and code will be released upon acceptance.
arXiv Detail & Related papers (2024-10-01T21:38:15Z)
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine [53.01393667775077]
This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine. It covers over 25 million images across 10 modalities with multigranular annotations for more than 65 diseases. Unlike the existing multimodal datasets, which are limited by the availability of image-text pairs, we have developed the first automated pipeline.
arXiv Detail & Related papers (2024-08-06T02:09:35Z)
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models [89.63342806812413]
We present an open-source toolkit for evaluating large multi-modality models based on PyTorch. VLMEvalKit implements over 70 different large multi-modality models, including both proprietary APIs and open-source models. We host OpenVLM Leaderboard to track the progress of multi-modality learning research.
arXiv Detail & Related papers (2024-07-16T13:06:15Z)
MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)
MvCo-DoT:Multi-View Contrastive Domain Transfer Network for Medical Report Generation [42.804058630251305]
We propose the first multi-view medical report generation model, called MvCo-DoT. MvCo-DoT first propose a multi-view contrastive learning (MvCo) strategy to help the deep reinforcement learning based model utilize the consistency of multi-view inputs. Extensive experiments on the IU X-Ray public dataset show that MvCo-DoT outperforms the SOTA medical report generation baselines in all metrics.
arXiv Detail & Related papers (2023-04-15T03:42:26Z)
MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification [14.820951153262685]
We introduce a new dataset, MELINDA, for Multimodal biomEdicaL experImeNt methoD clAssification. The dataset is collected in a fully automated distant supervision manner, where the labels are obtained from an existing curated database. We benchmark various state-of-the-art NLP and computer vision models, including unimodal models which only take either caption texts or images as inputs.
arXiv Detail & Related papers (2020-12-16T19:11:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.