Evaluating multiple large language models in pediatric ophthalmology
- URL: http://arxiv.org/abs/2311.04368v1
- Date: Tue, 7 Nov 2023 22:23:51 GMT
- Title: Evaluating multiple large language models in pediatric ophthalmology
- Authors: Jason Holmes, Rui Peng, Yiwei Li, Jinyu Hu, Zhengliang Liu, Zihao Wu,
Huan Zhao, Xi Jiang, Wei Liu, Hong Wei, Jie Zou, Tianming Liu, Yi Shao
- Abstract summary: The response effectiveness of different large language models (LLMs) and various individuals in pediatric ophthalmology consultations has not been clearly established yet.
This survey evaluated the performance of LLMs in highly specialized scenarios and compare them with the performance of medical students and physicians at different levels.
- Score: 37.16480878552708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: IMPORTANCE The response effectiveness of different large language models
(LLMs) and various individuals, including medical students, graduate students,
and practicing physicians, in pediatric ophthalmology consultations, has not
been clearly established yet. OBJECTIVE Design a 100-question exam based on
pediatric ophthalmology to evaluate the performance of LLMs in highly
specialized scenarios and compare them with the performance of medical students
and physicians at different levels. DESIGN, SETTING, AND PARTICIPANTS This
survey study assessed three LLMs, namely ChatGPT (GPT-3.5), GPT-4, and PaLM2,
were assessed alongside three human cohorts: medical students, postgraduate
students, and attending physicians, in their ability to answer questions
related to pediatric ophthalmology. It was conducted by administering
questionnaires in the form of test papers through the LLM network interface,
with the valuable participation of volunteers. MAIN OUTCOMES AND MEASURES Mean
scores of LLM and humans on 100 multiple-choice questions, as well as the
answer stability, correlation, and response confidence of each LLM. RESULTS
GPT-4 performed comparably to attending physicians, while ChatGPT (GPT-3.5) and
PaLM2 outperformed medical students but slightly trailed behind postgraduate
students. Furthermore, GPT-4 exhibited greater stability and confidence when
responding to inquiries compared to ChatGPT (GPT-3.5) and PaLM2. CONCLUSIONS
AND RELEVANCE Our results underscore the potential for LLMs to provide medical
assistance in pediatric ophthalmology and suggest significant capacity to guide
the education of medical students.
Related papers
- The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams [9.802579169561781]
Large language models (LLMs) can generate medical qualification exam questions and corresponding answers based on few-shot prompts.
The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions.
arXiv Detail & Related papers (2024-10-31T09:33:37Z) - CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment [54.91736546490813]
We introduce the RuleAlign framework, designed to align Large Language Models with specific diagnostic rules.
We develop a medical dialogue dataset comprising rule-based communications between patients and physicians.
Experimental results demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-08-22T17:44:40Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - Specialized curricula for training vision-language models in retinal image analysis [8.167708226285932]
Vision-language models (VLMs) automatically interpret images and summarize their findings as text.
In this work, we demonstrate that OpenAI's ChatGPT-4o model markedly underperforms compared to practicing ophthalmologists on specialist tasks.
arXiv Detail & Related papers (2024-07-11T11:31:48Z) - Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - Evaluating Large Language Models in Ophthalmology [34.13457684015814]
The performance of three different large language models (LLMS) in answering ophthalmology professional questions was evaluated.
GPT-4 showed significantly higher answer stability and confidence than GPT-3.5 and PaLM2.
arXiv Detail & Related papers (2023-11-07T16:19:45Z) - Integrating UMLS Knowledge into Large Language Models for Medical
Question Answering [18.06960842747575]
Large language models (LLMs) have demonstrated powerful text generation capabilities, bringing unprecedented innovation to the healthcare field.
We develop an augmented LLM framework based on the Unified Medical Language System (UMLS), aiming to better serve the healthcare community.
We employ LLaMa2-13b-chat and ChatGPT-3.5 as our benchmark models, and conduct automatic evaluations using the ROUGE Score and BERTScore on 104 questions from the LiveQA test set.
arXiv Detail & Related papers (2023-10-04T12:50:26Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.