Assessing The Potential Of Mid-Sized Language Models For Clinical QA
- URL: http://arxiv.org/abs/2404.15894v1
- Date: Wed, 24 Apr 2024 14:32:34 GMT
- Title: Assessing The Potential Of Mid-Sized Language Models For Clinical QA
- Authors: Elliot Bolton, Betty Xiong, Vijaytha Muralidharan, Joel Schamroth, Vivek Muralidharan, Christopher D. Manning, Roxana Daneshjou,
- Abstract summary: Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks.
Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks.
This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.
- Score: 24.116649037975762
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use and help researchers decide which model they should use, we compare their performance on two clinical question-answering (QA) tasks: MedQA and consumer query answering. We find that Mistral 7B is the best performing model, winning on all benchmarks and outperforming models trained specifically for the biomedical domain. While Mistral 7B's MedQA score of 63.0% approaches the original Med-PaLM, and it often can produce plausible responses to consumer health queries, room for improvement still exists. This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.
Related papers
- Capabilities of Gemini Models in Medicine [100.60391771032887]
We introduce Med-Gemini, a family of highly capable multimodal models specialized in medicine.
We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them.
Our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment.
arXiv Detail & Related papers (2024-04-29T04:11:28Z) - Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks [17.40940406100025]
We introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters.
Our systems achieved remarkable accuracy across six medical benchmarks.
Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8.
arXiv Detail & Related papers (2024-03-30T14:09:00Z) - BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text [82.7001841679981]
BioMedLM is a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles.
When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with larger models.
BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics.
arXiv Detail & Related papers (2024-03-27T10:18:21Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - SM70: A Large Language Model for Medical Devices [0.6906005491572401]
We introduce SM70, a large language model specifically designed for SpassMed's medical devices under the brand name 'JEE1' (pronounced as G1 and means 'Life')
To fine-tune SM70, we used around 800K data entries from the publicly available dataset MedAlpaca.
The evaluation is conducted across three benchmark datasets - MEDQA - USMLE, PUBMEDQA, and USMLE.
arXiv Detail & Related papers (2023-12-12T04:25:26Z) - MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
Large language models (LLMs) can potentially democratize access to medical knowledge.
We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
arXiv Detail & Related papers (2023-11-27T18:49:43Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Towards Expert-Level Medical Question Answering with Large Language
Models [16.882775912583355]
Large language models (LLMs) have catalyzed significant progress in medical question answering.
Here we present MedPaLM 2, which bridges gaps by leveraging combination of base improvements (PaLM 2), medical domain fine improvements, and prompting strategies.
We also observed approaching or exceeding state-the-art across MedMC-ofQA, PubMed, MMLU clinical topics datasets.
arXiv Detail & Related papers (2023-05-16T17:11:29Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.