A Comparative Study of Open-Source Large Language Models, GPT-4 and
Claude 2: Multiple-Choice Test Taking in Nephrology
- URL: http://arxiv.org/abs/2308.04709v1
- Date: Wed, 9 Aug 2023 05:01:28 GMT
- Title: A Comparative Study of Open-Source Large Language Models, GPT-4 and
Claude 2: Multiple-Choice Test Taking in Nephrology
- Authors: Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Fabien
Scalzo, Ira Kurtz
- Abstract summary: The study was conducted to evaluate the ability of LLM models to provide correct answers to nephSAP multiple-choice questions.
The findings of this study potentially have significant implications for the future medical training and patient care.
- Score: 0.6213359027997152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, there have been significant breakthroughs in the field of
natural language processing, particularly with the development of large
language models (LLMs). These LLMs have showcased remarkable capabilities on
various benchmarks. In the healthcare field, the exact role LLMs and other
future AI models will play remains unclear. There is a potential for these
models in the future to be used as part of adaptive physician training, medical
co-pilot applications, and digital patient interaction scenarios. The ability
of AI models to participate in medical training and patient care will depend in
part on their mastery of the knowledge content of specific medical fields. This
study investigated the medical knowledge capability of LLMs, specifically in
the context of internal medicine subspecialty multiple-choice test-taking
ability. We compared the performance of several open-source LLMs (Koala 7B,
Falcon 7B, Stable-Vicuna 13B, and Orca Mini 13B), to GPT-4 and Claude 2 on
multiple-choice questions in the field of Nephrology. Nephrology was chosen as
an example of a particularly conceptually complex subspecialty field within
internal medicine. The study was conducted to evaluate the ability of LLM
models to provide correct answers to nephSAP (Nephrology Self-Assessment
Program) multiple-choice questions. The overall success of open-sourced LLMs in
answering the 858 nephSAP multiple-choice questions correctly was 17.1% -
25.5%. In contrast, Claude 2 answered 54.4% of the questions correctly, whereas
GPT-4 achieved a score of 73.3%. We show that current widely used open-sourced
LLMs do poorly in their ability for zero-shot reasoning when compared to GPT-4
and Claude 2. The findings of this study potentially have significant
implications for the future of subspecialty medical training and patient care.
Related papers
- The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams [9.802579169561781]
Large language models (LLMs) can generate medical qualification exam questions and corresponding answers based on few-shot prompts.
The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions.
arXiv Detail & Related papers (2024-10-31T09:33:37Z) - A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.70022886795487]
OpenAI's o1 stands out as the first model with a chain-of-thought technique using reinforcement learning strategies.
This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality.
arXiv Detail & Related papers (2024-09-23T17:59:43Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As [1.0034156461900003]
Large language models (LLMs) show promising results in many aspects of language-based clinical practice.
We used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA"
We benchmarked this dataset using more than 24,500 questions on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus.
We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
arXiv Detail & Related papers (2024-06-06T08:41:46Z) - Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data [3.471944921180245]
We developed a fictional medical benchmark focused on a non-existent gland, the Glianorex.
This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities.
We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting.
arXiv Detail & Related papers (2024-06-04T15:08:56Z) - Towards Building Multilingual Language Model for Medicine [54.1382395897071]
We construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages.
We propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench.
Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks.
arXiv Detail & Related papers (2024-02-21T17:47:20Z) - MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
Large language models (LLMs) can potentially democratize access to medical knowledge.
We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
arXiv Detail & Related papers (2023-11-27T18:49:43Z) - A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language.
This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.