A Comparative Study of Open-Source Large Language Models, GPT-4 and
Claude 2: Multiple-Choice Test Taking in Nephrology
- URL: http://arxiv.org/abs/2308.04709v1
- Date: Wed, 9 Aug 2023 05:01:28 GMT
- Title: A Comparative Study of Open-Source Large Language Models, GPT-4 and
Claude 2: Multiple-Choice Test Taking in Nephrology
- Authors: Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Fabien
Scalzo, Ira Kurtz
- Abstract summary: The study was conducted to evaluate the ability of LLM models to provide correct answers to nephSAP multiple-choice questions.
The findings of this study potentially have significant implications for the future medical training and patient care.
- Score: 0.6213359027997152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, there have been significant breakthroughs in the field of
natural language processing, particularly with the development of large
language models (LLMs). These LLMs have showcased remarkable capabilities on
various benchmarks. In the healthcare field, the exact role LLMs and other
future AI models will play remains unclear. There is a potential for these
models in the future to be used as part of adaptive physician training, medical
co-pilot applications, and digital patient interaction scenarios. The ability
of AI models to participate in medical training and patient care will depend in
part on their mastery of the knowledge content of specific medical fields. This
study investigated the medical knowledge capability of LLMs, specifically in
the context of internal medicine subspecialty multiple-choice test-taking
ability. We compared the performance of several open-source LLMs (Koala 7B,
Falcon 7B, Stable-Vicuna 13B, and Orca Mini 13B), to GPT-4 and Claude 2 on
multiple-choice questions in the field of Nephrology. Nephrology was chosen as
an example of a particularly conceptually complex subspecialty field within
internal medicine. The study was conducted to evaluate the ability of LLM
models to provide correct answers to nephSAP (Nephrology Self-Assessment
Program) multiple-choice questions. The overall success of open-sourced LLMs in
answering the 858 nephSAP multiple-choice questions correctly was 17.1% -
25.5%. In contrast, Claude 2 answered 54.4% of the questions correctly, whereas
GPT-4 achieved a score of 73.3%. We show that current widely used open-sourced
LLMs do poorly in their ability for zero-shot reasoning when compared to GPT-4
and Claude 2. The findings of this study potentially have significant
implications for the future of subspecialty medical training and patient care.
Related papers
- Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - MedG-KRP: Medical Graph Knowledge Representation Probing [0.6496030410305753]
Large language models (LLMs) have recently emerged as powerful tools, finding many medical applications.
We introduce a knowledge graph (KG)-based method to evaluate the biomedical reasoning abilities of LLMs.
We test GPT-4, Llama3-70b, and PalmyraMed-70b, a specialized medical model.
arXiv Detail & Related papers (2024-12-14T22:23:20Z) - The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams [9.802579169561781]
Large language models (LLMs) can generate medical qualification exam questions and corresponding answers based on few-shot prompts.
The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions.
arXiv Detail & Related papers (2024-10-31T09:33:37Z) - A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.70022886795487]
OpenAI's o1 stands out as the first model with a chain-of-thought technique using reinforcement learning strategies.
This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality.
arXiv Detail & Related papers (2024-09-23T17:59:43Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data [3.471944921180245]
We developed a fictional medical benchmark focused on a non-existent gland, the Glianorex.
This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities.
We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting.
arXiv Detail & Related papers (2024-06-04T15:08:56Z) - MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
Large language models (LLMs) can potentially democratize access to medical knowledge.
We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
arXiv Detail & Related papers (2023-11-27T18:49:43Z) - A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language.
This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.