GPT-4 can pass the Korean National Licensing Examination for Korean
Medicine Doctors
- URL: http://arxiv.org/abs/2303.17807v2
- Date: Fri, 17 Nov 2023 01:49:57 GMT
- Title: GPT-4 can pass the Korean National Licensing Examination for Korean
Medicine Doctors
- Authors: Dongyeop Jang, Tae-Rim Yun, Choong-Yeol Lee, Young-Kyu Kwon, Chang-Eop
Kim
- Abstract summary: This study assessed the capabilities of GPT-4 in traditional Korean medicine (TKM)
We optimized prompts with Chinese-term annotation, English translation for questions and instruction, exam-optimized instruction, and self-consistency.
GPT-4 with optimized prompts achieved 66.18% accuracy, surpassing the examination's average pass mark of 60% and the 40% minimum for each subject.
- Score: 9.374652839580182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional Korean medicine (TKM) emphasizes individualized diagnosis and
treatment. This uniqueness makes AI modeling difficult due to limited data and
implicit processes. Large language models (LLMs) have demonstrated impressive
medical inference, even without advanced training in medical texts. This study
assessed the capabilities of GPT-4 in TKM, using the Korean National Licensing
Examination for Korean Medicine Doctors (K-NLEKMD) as a benchmark. The
K-NLEKMD, administered by a national organization, encompasses 12 major
subjects in TKM. We optimized prompts with Chinese-term annotation, English
translation for questions and instruction, exam-optimized instruction, and
self-consistency. GPT-4 with optimized prompts achieved 66.18% accuracy,
surpassing both the examination's average pass mark of 60% and the 40% minimum
for each subject. The gradual introduction of language-related prompts and
prompting techniques enhanced the accuracy from 51.82% to its maximum accuracy.
GPT-4 showed low accuracy in subjects including public health &
medicine-related law, internal medicine (2) which are localized in Korea and
TKM. The model's accuracy was lower for questions requiring TKM-specialized
knowledge. It exhibited higher accuracy in diagnosis-based and recall-based
questions than in intervention-based questions. A positive correlation was
observed between the consistency and accuracy of GPT-4's responses. This study
unveils both the potential and challenges of applying LLMs to TKM. These
findings underline the potential of LLMs like GPT-4 in culturally adapted
medicine, especially TKM, for tasks such as clinical assistance, medical
education, and research. But they also point towards the necessity for the
development of methods to mitigate cultural bias inherent in large language
models and validate their efficacy in real-world clinical settings.
Related papers
- Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their abilities to directly recall and apply factual medical knowledge remains under-explored.<n>We introduce the Medical Knowledge Judgment dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized vocabularies and knowledge graphs.<n>Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - Polish-English medical knowledge transfer: A new benchmark and results [0.6804079979762627]
This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams.<n>It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora.<n>We evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students.
arXiv Detail & Related papers (2024-11-30T19:02:34Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams [32.77551245372691]
Existing benchmarks for evaluating Large Language Models (LLMs) in healthcare predominantly focus on medical doctors.
We introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese.
EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists.
arXiv Detail & Related papers (2024-06-17T08:40:36Z) - Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine [3.471944921180245]
Large Language Models (LLMs) demonstrate significant potential in the medical domain.<n>They are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE.<n>We created a fictional medical benchmark centered on an imaginary organ, the Glianorex, allowing us to separate memorized knowledge from reasoning ability.
arXiv Detail & Related papers (2024-06-04T15:08:56Z) - KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations [7.8387874506025215]
We present KorMedMCQA, the first Korean Medical Multiple-Choice Question Answering benchmark.
The dataset contains 7,469 questions from examinations for doctor, nurse, pharmacist, and dentist.
arXiv Detail & Related papers (2024-03-03T10:31:49Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case
Study in Medicine [89.46836590149883]
We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training.
We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks.
With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite.
arXiv Detail & Related papers (2023-11-28T03:16:12Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese
Medical Exam Dataset [31.047827145874844]
We introduce CMExam, sourced from the Chinese National Medical Licensing Examination.
CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner.
For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels.
arXiv Detail & Related papers (2023-06-05T16:48:41Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z) - Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam
and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted
Medical Education and Decision Making in Radiation Oncology [7.094683738932199]
We evaluate the performance of ChatGPT-4 in radiation oncology using the 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases.
For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 63.65% and 74.57%, respectively.
ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry.
arXiv Detail & Related papers (2023-04-24T09:50:39Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.