Evaluating multiple large language models in pediatric ophthalmology
- URL: http://arxiv.org/abs/2311.04368v1
- Date: Tue, 7 Nov 2023 22:23:51 GMT
- Title: Evaluating multiple large language models in pediatric ophthalmology
- Authors: Jason Holmes, Rui Peng, Yiwei Li, Jinyu Hu, Zhengliang Liu, Zihao Wu,
Huan Zhao, Xi Jiang, Wei Liu, Hong Wei, Jie Zou, Tianming Liu, Yi Shao
- Abstract summary: The response effectiveness of different large language models (LLMs) and various individuals in pediatric ophthalmology consultations has not been clearly established yet.
This survey evaluated the performance of LLMs in highly specialized scenarios and compare them with the performance of medical students and physicians at different levels.
- Score: 37.16480878552708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: IMPORTANCE The response effectiveness of different large language models
(LLMs) and various individuals, including medical students, graduate students,
and practicing physicians, in pediatric ophthalmology consultations, has not
been clearly established yet. OBJECTIVE Design a 100-question exam based on
pediatric ophthalmology to evaluate the performance of LLMs in highly
specialized scenarios and compare them with the performance of medical students
and physicians at different levels. DESIGN, SETTING, AND PARTICIPANTS This
survey study assessed three LLMs, namely ChatGPT (GPT-3.5), GPT-4, and PaLM2,
were assessed alongside three human cohorts: medical students, postgraduate
students, and attending physicians, in their ability to answer questions
related to pediatric ophthalmology. It was conducted by administering
questionnaires in the form of test papers through the LLM network interface,
with the valuable participation of volunteers. MAIN OUTCOMES AND MEASURES Mean
scores of LLM and humans on 100 multiple-choice questions, as well as the
answer stability, correlation, and response confidence of each LLM. RESULTS
GPT-4 performed comparably to attending physicians, while ChatGPT (GPT-3.5) and
PaLM2 outperformed medical students but slightly trailed behind postgraduate
students. Furthermore, GPT-4 exhibited greater stability and confidence when
responding to inquiries compared to ChatGPT (GPT-3.5) and PaLM2. CONCLUSIONS
AND RELEVANCE Our results underscore the potential for LLMs to provide medical
assistance in pediatric ophthalmology and suggest significant capacity to guide
the education of medical students.
Related papers
- Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language.
This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z) - Evaluating Large Language Models in Ophthalmology [34.13457684015814]
The performance of three different large language models (LLMS) in answering ophthalmology professional questions was evaluated.
GPT-4 showed significantly higher answer stability and confidence than GPT-3.5 and PaLM2.
arXiv Detail & Related papers (2023-11-07T16:19:45Z) - Integrating UMLS Knowledge into Large Language Models for Medical
Question Answering [18.06960842747575]
Large language models (LLMs) have demonstrated powerful text generation capabilities, bringing unprecedented innovation to the healthcare field.
We develop an augmented LLM framework based on the Unified Medical Language System (UMLS), aiming to better serve the healthcare community.
We employ LLaMa2-13b-chat and ChatGPT-3.5 as our benchmark models, and conduct automatic evaluations using the ROUGE Score and BERTScore on 104 questions from the LiveQA test set.
arXiv Detail & Related papers (2023-10-04T12:50:26Z) - Augmenting Black-box LLMs with Medical Textbooks for Clinical Question
Answering [54.13933019557655]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z) - MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large
Language Models in Medicine [16.75133391080187]
A set of evaluation criteria is designed based on a comprehensive literature review.
Existing candidate criteria are optimized for using a Delphi method by five experts in medicine and engineering.
Three chatbots are evaluated, ChatGPT by OpenAI, ERNIE Bot by Baidu Inc., and Doctor PuJiang (Dr. PJ) by Shanghai Artificial Intelligence Laboratory.
arXiv Detail & Related papers (2023-05-12T09:37:13Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.