A Comparative Study of Open-Source Large Language Models, GPT-4 and
  Claude 2: Multiple-Choice Test Taking in Nephrology
        - URL: http://arxiv.org/abs/2308.04709v1
- Date: Wed, 9 Aug 2023 05:01:28 GMT
- Title: A Comparative Study of Open-Source Large Language Models, GPT-4 and
  Claude 2: Multiple-Choice Test Taking in Nephrology
- Authors: Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Fabien
  Scalzo, Ira Kurtz
- Abstract summary: The study was conducted to evaluate the ability of LLM models to provide correct answers to nephSAP multiple-choice questions.
The findings of this study potentially have significant implications for the future medical training and patient care.
- Score: 0.6213359027997152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   In recent years, there have been significant breakthroughs in the field of
natural language processing, particularly with the development of large
language models (LLMs). These LLMs have showcased remarkable capabilities on
various benchmarks. In the healthcare field, the exact role LLMs and other
future AI models will play remains unclear. There is a potential for these
models in the future to be used as part of adaptive physician training, medical
co-pilot applications, and digital patient interaction scenarios. The ability
of AI models to participate in medical training and patient care will depend in
part on their mastery of the knowledge content of specific medical fields. This
study investigated the medical knowledge capability of LLMs, specifically in
the context of internal medicine subspecialty multiple-choice test-taking
ability. We compared the performance of several open-source LLMs (Koala 7B,
Falcon 7B, Stable-Vicuna 13B, and Orca Mini 13B), to GPT-4 and Claude 2 on
multiple-choice questions in the field of Nephrology. Nephrology was chosen as
an example of a particularly conceptually complex subspecialty field within
internal medicine. The study was conducted to evaluate the ability of LLM
models to provide correct answers to nephSAP (Nephrology Self-Assessment
Program) multiple-choice questions. The overall success of open-sourced LLMs in
answering the 858 nephSAP multiple-choice questions correctly was 17.1% -
25.5%. In contrast, Claude 2 answered 54.4% of the questions correctly, whereas
GPT-4 achieved a score of 73.3%. We show that current widely used open-sourced
LLMs do poorly in their ability for zero-shot reasoning when compared to GPT-4
and Claude 2. The findings of this study potentially have significant
implications for the future of subspecialty medical training and patient care.
 
      
        Related papers
        - Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge   with Structured One-Hop Judgment [108.55277188617035]
 Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
 arXiv  Detail & Related papers  (2025-02-20T05:27:51Z)
- MedG-KRP: Medical Graph Knowledge Representation Probing [0.6496030410305753]
 Large language models (LLMs) have recently emerged as powerful tools, finding many medical applications.
We introduce a knowledge graph (KG)-based method to evaluate the biomedical reasoning abilities of LLMs.
We test GPT-4, Llama3-70b, and PalmyraMed-70b, a specialized medical model.
 arXiv  Detail & Related papers  (2024-12-14T22:23:20Z)
- The Potential of LLMs in Medical Education: Generating Questions and   Answers for Qualification Exams [9.802579169561781]
 Large language models (LLMs) can generate medical qualification exam questions and corresponding answers based on few-shot prompts.
The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions.
 arXiv  Detail & Related papers  (2024-10-31T09:33:37Z)
- A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.70022886795487]
 OpenAI's o1 stands out as the first model with a chain-of-thought technique using reinforcement learning strategies.
This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality.
 arXiv  Detail & Related papers  (2024-09-23T17:59:43Z)
- GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards   General Medical AI [67.09501109871351]
 Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
 arXiv  Detail & Related papers  (2024-08-06T17:59:21Z)
- MedExQA: Medical Question Answering Benchmark with Multiple Explanations [2.2246416434538308]
 This paper introduces MedExQA, a novel benchmark in medical question-answering to evaluate large language models' (LLMs) understanding of medical knowledge through explanations.
By constructing datasets across five distinct medical specialties, we address a major gap in current medical QA benchmarks.
Our work highlights the importance of explainability in medical LLMs, proposes an effective methodology for evaluating models beyond classification accuracy, and sheds light on one specific domain, speech language pathology.
 arXiv  Detail & Related papers  (2024-06-10T14:47:04Z)
- Performance of large language models in numerical vs. semantic medical   knowledge: Benchmarking on evidence-based Q&As [1.0034156461900003]
 Large language models (LLMs) show promising results in many aspects of language-based clinical practice.
We used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA"
We benchmarked this dataset using more than 24,500 questions on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus.
We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
 arXiv  Detail & Related papers  (2024-06-06T08:41:46Z)
- Multiple Choice Questions and Large Languages Models: A Case Study with   Fictional Medical Data [3.471944921180245]
 We developed a fictional medical benchmark focused on a non-existent gland, the Glianorex.
This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities.
We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting.
 arXiv  Detail & Related papers  (2024-06-04T15:08:56Z)
- Towards Building Multilingual Language Model for Medicine [54.1382395897071]
 We construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages.
We propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench.
Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks.
 arXiv  Detail & Related papers  (2024-02-21T17:47:20Z)
- MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
 Large language models (LLMs) can potentially democratize access to medical knowledge.
We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
 arXiv  Detail & Related papers  (2023-11-27T18:49:43Z)
- A Survey of Large Language Models in Medicine: Progress, Application,   and Challenge [85.09998659355038]
 Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language.
This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
 arXiv  Detail & Related papers  (2023-11-09T02:55:58Z)
- Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question   Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
 We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
 arXiv  Detail & Related papers  (2023-09-05T13:39:38Z)
- Large Language Models Leverage External Knowledge to Extend Clinical
  Insight Beyond Language Boundaries [48.48630043740588]
 Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
 arXiv  Detail & Related papers  (2023-05-17T12:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.