Towards Expert-Level Medical Question Answering with Large Language
Models
- URL: http://arxiv.org/abs/2305.09617v1
- Date: Tue, 16 May 2023 17:11:29 GMT
- Title: Towards Expert-Level Medical Question Answering with Large Language
Models
- Authors: Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le
Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike
Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield, Sushant
Prakash, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Nenad Tomasev,
Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi, Joelle Barral,
Dale Webster, Greg S. Corrado, Yossi Matias, Shekoofeh Azizi, Alan
Karthikesalingam, Vivek Natarajan
- Abstract summary: Large language models (LLMs) have catalyzed significant progress in medical question answering.
Here we present MedPaLM 2, which bridges gaps by leveraging combination of base improvements (PaLM 2), medical domain fine improvements, and prompting strategies.
We also observed approaching or exceeding state-the-art across MedMC-ofQA, PubMed, MMLU clinical topics datasets.
- Score: 16.882775912583355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent artificial intelligence (AI) systems have reached milestones in "grand
challenges" ranging from Go to protein-folding. The capability to retrieve
medical knowledge, reason over it, and answer medical questions comparably to
physicians has long been viewed as one such grand challenge.
Large language models (LLMs) have catalyzed significant progress in medical
question answering; Med-PaLM was the first model to exceed a "passing" score in
US Medical Licensing Examination (USMLE) style questions with a score of 67.2%
on the MedQA dataset. However, this and other prior work suggested significant
room for improvement, especially when models' answers were compared to
clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by
leveraging a combination of base LLM improvements (PaLM 2), medical domain
finetuning, and prompting strategies including a novel ensemble refinement
approach.
Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM
by over 19% and setting a new state-of-the-art. We also observed performance
approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU
clinical topics datasets.
We performed detailed human evaluations on long-form questions along multiple
axes relevant to clinical applications. In pairwise comparative ranking of 1066
consumer medical questions, physicians preferred Med-PaLM 2 answers to those
produced by physicians on eight of nine axes pertaining to clinical utility (p
< 0.001). We also observed significant improvements compared to Med-PaLM on
every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form
"adversarial" questions to probe LLM limitations.
While further studies are necessary to validate the efficacy of these models
in real-world settings, these results highlight rapid progress towards
physician-level performance in medical question answering.
Related papers
- A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.70022886795487]
OpenAI's o1 stands out as the first model with a chain-of-thought technique using reinforcement learning strategies.
This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality.
arXiv Detail & Related papers (2024-09-23T17:59:43Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - Capabilities of Gemini Models in Medicine [100.60391771032887]
We introduce Med-Gemini, a family of highly capable multimodal models specialized in medicine.
We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them.
Our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment.
arXiv Detail & Related papers (2024-04-29T04:11:28Z) - Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches [7.3384872719063114]
We develop and refined a series of medical Large Language Models (LLMs) based on the Llama-2 architecture.
Our experiments systematically evaluate the effectiveness of these tuning strategies across various well-known medical benchmarks.
arXiv Detail & Related papers (2024-04-23T06:36:21Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark.
All images in this benchmark are sourced from authentic medical scenarios.
We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z) - SM70: A Large Language Model for Medical Devices [0.6906005491572401]
We introduce SM70, a large language model specifically designed for SpassMed's medical devices under the brand name 'JEE1' (pronounced as G1 and means 'Life')
To fine-tune SM70, we used around 800K data entries from the publicly available dataset MedAlpaca.
The evaluation is conducted across three benchmark datasets - MEDQA - USMLE, PUBMEDQA, and USMLE.
arXiv Detail & Related papers (2023-12-12T04:25:26Z) - Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain.
Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.