Evaluating Large Language Models in Ophthalmology
- URL: http://arxiv.org/abs/2311.04933v1
- Date: Tue, 7 Nov 2023 16:19:45 GMT
- Title: Evaluating Large Language Models in Ophthalmology
- Authors: Jason Holmes, Shuyuan Ye, Yiwei Li, Shi-Nan Wu, Zhengliang Liu, Zihao
Wu, Jinyu Hu, Huan Zhao, Xi Jiang, Wei Liu, Hong Wei, Jie Zou, Tianming Liu,
Yi Shao
- Abstract summary: The performance of three different large language models (LLMS) in answering ophthalmology professional questions was evaluated.
GPT-4 showed significantly higher answer stability and confidence than GPT-3.5 and PaLM2.
- Score: 34.13457684015814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Purpose: The performance of three different large language models (LLMS)
(GPT-3.5, GPT-4, and PaLM2) in answering ophthalmology professional questions
was evaluated and compared with that of three different professional
populations (medical undergraduates, medical masters, and attending
physicians). Methods: A 100-item ophthalmology single-choice test was
administered to three different LLMs (GPT-3.5, GPT-4, and PaLM2) and three
different professional levels (medical undergraduates, medical masters, and
attending physicians), respectively. The performance of LLM was comprehensively
evaluated and compared with the human group in terms of average score,
stability, and confidence. Results: Each LLM outperformed undergraduates in
general, with GPT-3.5 and PaLM2 being slightly below the master's level, while
GPT-4 showed a level comparable to that of attending physicians. In addition,
GPT-4 showed significantly higher answer stability and confidence than GPT-3.5
and PaLM2. Conclusion: Our study shows that LLM represented by GPT-4 performs
better in the field of ophthalmology. With further improvements, LLM will bring
unexpected benefits in medical education and clinical decision making in the
near future.
Related papers
- GPT-4 passes most of the 297 written Polish Board Certification Examinations [0.5461938536945723]
This study evaluated the performance of three Generative Pretrained Transformer (GPT) models on the Polish Board Certification Exam (Pa'nstwowy Egzamin Specjalizacyjny, PES) dataset.
The GPT models varied significantly, displaying excellence in exams related to certain specialties while completely failing others.
arXiv Detail & Related papers (2024-04-29T09:08:22Z) - A Continued Pretrained LLM Approach for Automatic Medical Note Generation [10.981182525560751]
We introduce HEAL, the first continuously trained 13B LLaMA2-based LLM that is purpose-built for medical conversations and measured on automated scribing.
Our results demonstrate that HEAL outperforms GPT-4 and PMC-LLaMA in PubMedQA, with an accuracy of 78.4%.
Remarkably, HEAL surpasses GPT-4 and Med-PaLM 2 in identifying more correct medical concepts and exceeds the performance of human scribes in correctness and completeness.
arXiv Detail & Related papers (2024-03-14T02:55:37Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
Large language models (LLMs) can potentially democratize access to medical knowledge.
We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
arXiv Detail & Related papers (2023-11-27T18:49:43Z) - HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs [62.73042700847977]
HuatuoGPT-II has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks.
It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine.
arXiv Detail & Related papers (2023-11-16T10:56:24Z) - Evaluating multiple large language models in pediatric ophthalmology [37.16480878552708]
The response effectiveness of different large language models (LLMs) and various individuals in pediatric ophthalmology consultations has not been clearly established yet.
This survey evaluated the performance of LLMs in highly specialized scenarios and compare them with the performance of medical students and physicians at different levels.
arXiv Detail & Related papers (2023-11-07T22:23:51Z) - A Comparative Study of Open-Source Large Language Models, GPT-4 and
Claude 2: Multiple-Choice Test Taking in Nephrology [0.6213359027997152]
The study was conducted to evaluate the ability of LLM models to provide correct answers to nephSAP multiple-choice questions.
The findings of this study potentially have significant implications for the future medical training and patient care.
arXiv Detail & Related papers (2023-08-09T05:01:28Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.