Related papers: A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology

A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology

URL: http://arxiv.org/abs/2308.04709v1
Date: Wed, 9 Aug 2023 05:01:28 GMT
Title: A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology
Authors: Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Fabien Scalzo, Ira Kurtz
Abstract summary: The study was conducted to evaluate the ability of LLM models to provide correct answers to nephSAP multiple-choice questions. The findings of this study potentially have significant implications for the future medical training and patient care.
Score: 0.6213359027997152
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, there have been significant breakthroughs in the field of natural language processing, particularly with the development of large language models (LLMs). These LLMs have showcased remarkable capabilities on various benchmarks. In the healthcare field, the exact role LLMs and other future AI models will play remains unclear. There is a potential for these models in the future to be used as part of adaptive physician training, medical co-pilot applications, and digital patient interaction scenarios. The ability of AI models to participate in medical training and patient care will depend in part on their mastery of the knowledge content of specific medical fields. This study investigated the medical knowledge capability of LLMs, specifically in the context of internal medicine subspecialty multiple-choice test-taking ability. We compared the performance of several open-source LLMs (Koala 7B, Falcon 7B, Stable-Vicuna 13B, and Orca Mini 13B), to GPT-4 and Claude 2 on multiple-choice questions in the field of Nephrology. Nephrology was chosen as an example of a particularly conceptually complex subspecialty field within internal medicine. The study was conducted to evaluate the ability of LLM models to provide correct answers to nephSAP (Nephrology Self-Assessment Program) multiple-choice questions. The overall success of open-sourced LLMs in answering the 858 nephSAP multiple-choice questions correctly was 17.1% - 25.5%. In contrast, Claude 2 answered 54.4% of the questions correctly, whereas GPT-4 achieved a score of 73.3%. We show that current widely used open-sourced LLMs do poorly in their ability for zero-shot reasoning when compared to GPT-4 and Claude 2. The findings of this study potentially have significant implications for the future of subspecialty medical training and patient care.

Related papers

Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
MedG-KRP: Medical Graph Knowledge Representation Probing [0.6496030410305753]
Large language models (LLMs) have recently emerged as powerful tools, finding many medical applications. We introduce a knowledge graph (KG)-based method to evaluate the biomedical reasoning abilities of LLMs. We test GPT-4, Llama3-70b, and PalmyraMed-70b, a specialized medical model.
arXiv Detail & Related papers (2024-12-14T22:23:20Z)
The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams [9.802579169561781]
Large language models (LLMs) can generate medical qualification exam questions and corresponding answers based on few-shot prompts. The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions.
arXiv Detail & Related papers (2024-10-31T09:33:37Z)
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.70022886795487]
OpenAI's o1 stands out as the first model with a chain-of-thought technique using reinforcement learning strategies. This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality.
arXiv Detail & Related papers (2024-09-23T17:59:43Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As [1.0034156461900003]
Large language models (LLMs) show promising results in many aspects of language-based clinical practice. We used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA" We benchmarked this dataset using more than 24,500 questions on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus. We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
arXiv Detail & Related papers (2024-06-06T08:41:46Z)
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data [3.471944921180245]
We developed a fictional medical benchmark focused on a non-existent gland, the Glianorex. This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities. We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting.
arXiv Detail & Related papers (2024-06-04T15:08:56Z)
Towards Building Multilingual Language Model for Medicine [54.1382395897071]
We construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages. We propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks.
arXiv Detail & Related papers (2024-02-21T17:47:20Z)
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
Large language models (LLMs) can potentially democratize access to medical knowledge. We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
arXiv Detail & Related papers (2023-11-27T18:49:43Z)
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language. This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z)
Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT) LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules. We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z)
Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.