CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
- URL: http://arxiv.org/abs/2406.06007v3
- Date: Sun, 03 Nov 2024 16:54:14 GMT
- Title: CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
- Authors: Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, Wenhao Zheng, Zhaoyang Wang, Xiao Wang, Xuchao Zhang, Chetan Bansal, Marc Niethammer, Junzhou Huang, Hongtu Zhu, Yun Li, Jimeng Sun, Zongyuan Ge, Gang Li, James Zou, Huaxiu Yao,
- Abstract summary: We introduce CARES and aim to evaluate the Trustworthiness of Med-LVLMs across the medical domain.
We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness.
- Score: 92.04812189642418
- License:
- Abstract: Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment. In this paper, we introduce CARES and aim to comprehensively evaluate the Trustworthiness of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness. CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions. Our analysis reveals that the models consistently exhibit concerns regarding trustworthiness, often displaying factual inaccuracies and failing to maintain fairness across different demographic groups. Furthermore, they are vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly release our benchmark and code in https://cares-ai.github.io/.
Related papers
- Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering [51.26412822853409]
We present a novel personalized federated learning (pFL) method for medical visual question answering (VQA) models.
Our method introduces learnable prompts into a Transformer architecture to efficiently train it on diverse medical datasets without massive computational costs.
arXiv Detail & Related papers (2024-10-23T00:31:17Z) - MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models [20.781551849965357]
We introduce MediConfusion, a challenging medical Visual Question Answering (VQA) benchmark dataset.
We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts.
We also extract common patterns of model failure that may help the design of a new generation of more trustworthy and reliable MLLMs in healthcare.
arXiv Detail & Related papers (2024-09-23T18:59:37Z) - A Survey on Trustworthiness in Foundation Models for Medical Image Analysis [27.876946673940452]
We present a novel taxonomy of foundation models used in medical imaging.
We focus on segmentation, medical report generation, medical question and answering (Q&A), and disease diagnosis.
Our analysis underscores the imperative for advancing towards trustworthy AI in medical image analysis.
arXiv Detail & Related papers (2024-07-03T18:07:57Z) - Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study [51.19622266249408]
MultiTrust is the first comprehensive and unified benchmark on the trustworthiness of MLLMs.
Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts.
Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks.
arXiv Detail & Related papers (2024-06-11T08:38:13Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark.
All images in this benchmark are sourced from authentic medical scenarios.
We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z) - Gemini Goes to Med School: Exploring the Capabilities of Multimodal
Large Language Models on Medical Challenge Problems & Hallucinations [0.0]
We comprehensively evaluated open-source and Google's new multimodal LLM called Gemini.
While Gemini showed competence, it lagged behind state-of-the-art models like MedPaLM 2 and GPT-4 in diagnostic accuracy.
Gemini is highly susceptible to hallucinations, overconfidence, and knowledge gaps, which indicate risks if deployed uncritically.
arXiv Detail & Related papers (2024-02-10T19:08:28Z) - Medical Foundation Models are Susceptible to Targeted Misinformation
Attacks [3.252906830953028]
Large language models (LLMs) have broad medical knowledge and can reason about medical information across many domains.
We demonstrate a concerning vulnerability of LLMs in medicine through targeted manipulation of just 1.1% of the model's weights.
We validate our findings in a set of 1,038 incorrect biomedical facts.
arXiv Detail & Related papers (2023-09-29T06:44:36Z) - Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain.
Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.