Gemini Goes to Med School: Exploring the Capabilities of Multimodal
Large Language Models on Medical Challenge Problems & Hallucinations
- URL: http://arxiv.org/abs/2402.07023v1
- Date: Sat, 10 Feb 2024 19:08:28 GMT
- Title: Gemini Goes to Med School: Exploring the Capabilities of Multimodal
Large Language Models on Medical Challenge Problems & Hallucinations
- Authors: Ankit Pal, Malaikannan Sankarasubbu
- Abstract summary: We comprehensively evaluated open-source and Google's new multimodal LLM called Gemini.
While Gemini showed competence, it lagged behind state-of-the-art models like MedPaLM 2 and GPT-4 in diagnostic accuracy.
Gemini is highly susceptible to hallucinations, overconfidence, and knowledge gaps, which indicate risks if deployed uncritically.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have the potential to be valuable in the healthcare
industry, but it's crucial to verify their safety and effectiveness through
rigorous evaluation. For this purpose, we comprehensively evaluated both
open-source LLMs and Google's new multimodal LLM called Gemini across Medical
reasoning, hallucination detection, and Medical Visual Question Answering
tasks. While Gemini showed competence, it lagged behind state-of-the-art models
like MedPaLM 2 and GPT-4 in diagnostic accuracy. Additionally, Gemini achieved
an accuracy of 61.45\% on the medical VQA dataset, significantly lower than
GPT-4V's score of 88\%. Our analysis revealed that Gemini is highly susceptible
to hallucinations, overconfidence, and knowledge gaps, which indicate risks if
deployed uncritically. We also performed a detailed analysis by medical subject
and test type, providing actionable feedback for developers and clinicians. To
mitigate risks, we applied prompting strategies that improved performance.
Additionally, we facilitated future research and development by releasing a
Python module for medical LLM evaluation and establishing a dedicated
leaderboard on Hugging Face for medical domain LLMs. Python module can be found
at https://github.com/promptslab/RosettaEval
Related papers
- A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.70022886795487]
OpenAI's o1 stands out as the first model with a chain-of-thought technique using reinforcement learning strategies.
This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality.
arXiv Detail & Related papers (2024-09-23T17:59:43Z) - CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models [92.04812189642418]
We introduce CARES and aim to evaluate the Trustworthiness of Med-LVLMs across the medical domain.
We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness.
arXiv Detail & Related papers (2024-06-10T04:07:09Z) - Capabilities of Gemini Models in Medicine [100.60391771032887]
We introduce Med-Gemini, a family of highly capable multimodal models specialized in medicine.
We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them.
Our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment.
arXiv Detail & Related papers (2024-04-29T04:11:28Z) - Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain [21.96129653695565]
Large Language Models (LLMs) can assist and potentially correct physicians in medical decision-making tasks.
We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios.
arXiv Detail & Related papers (2024-03-29T16:59:13Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise [78.54563675327198]
Gemini is Google's newest and most capable MLLM built from the ground up for multi-modality.
Can Gemini challenge GPT-4V's leading position in multi-modal learning?
We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx.
arXiv Detail & Related papers (2023-12-19T18:59:22Z) - An In-depth Look at Gemini's Language Abilities [49.897870833250494]
We compare the abilities of the OpenAI GPT and Google Gemini models.
We perform this analysis over 10 datasets testing a variety of language abilities.
We find that Gemini Pro achieves accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo.
arXiv Detail & Related papers (2023-12-18T18:47:42Z) - NoMIRACL: Knowing When You Don't Know for Robust Multilingual
Retrieval-Augmented Generation [92.5132418788568]
Retrieval-augmented generation (RAG) grounds large language model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations.
NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages.
We measure robustness using two metrics: (i) hallucination rate, measuring model tendency to hallucinate an answer, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z) - Med-HALT: Medical Domain Hallucination Test for Large Language Models [0.0]
This research paper focuses on the challenges posed by hallucinations in large language models (LLMs)
We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations.
arXiv Detail & Related papers (2023-07-28T06:43:04Z) - Complex Mixer for MedMNIST Classification Decathlon [12.402054374952485]
We develop a Complex Mixer (C-Mixer) with a pre-training framework to alleviate the problem of insufficient information and uncertainty in the label space.
Our method shows surprising potential on both the standard MedMNIST (v2) dataset and the customized weakly supervised datasets.
arXiv Detail & Related papers (2023-04-20T02:34:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.