Evaluating Large Language Models in Ophthalmology
- URL: http://arxiv.org/abs/2311.04933v1
- Date: Tue, 7 Nov 2023 16:19:45 GMT
- Title: Evaluating Large Language Models in Ophthalmology
- Authors: Jason Holmes, Shuyuan Ye, Yiwei Li, Shi-Nan Wu, Zhengliang Liu, Zihao
Wu, Jinyu Hu, Huan Zhao, Xi Jiang, Wei Liu, Hong Wei, Jie Zou, Tianming Liu,
Yi Shao
- Abstract summary: The performance of three different large language models (LLMS) in answering ophthalmology professional questions was evaluated.
GPT-4 showed significantly higher answer stability and confidence than GPT-3.5 and PaLM2.
- Score: 34.13457684015814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Purpose: The performance of three different large language models (LLMS)
(GPT-3.5, GPT-4, and PaLM2) in answering ophthalmology professional questions
was evaluated and compared with that of three different professional
populations (medical undergraduates, medical masters, and attending
physicians). Methods: A 100-item ophthalmology single-choice test was
administered to three different LLMs (GPT-3.5, GPT-4, and PaLM2) and three
different professional levels (medical undergraduates, medical masters, and
attending physicians), respectively. The performance of LLM was comprehensively
evaluated and compared with the human group in terms of average score,
stability, and confidence. Results: Each LLM outperformed undergraduates in
general, with GPT-3.5 and PaLM2 being slightly below the master's level, while
GPT-4 showed a level comparable to that of attending physicians. In addition,
GPT-4 showed significantly higher answer stability and confidence than GPT-3.5
and PaLM2. Conclusion: Our study shows that LLM represented by GPT-4 performs
better in the field of ophthalmology. With further improvements, LLM will bring
unexpected benefits in medical education and clinical decision making in the
near future.
Related papers
- Evaluating the Impact of a Specialized LLM on Physician Experience in Clinical Decision Support: A Comparison of Ask Avo and ChatGPT-4 [0.3999851878220878]
Large language models (LLMs) to augment clinical decision support systems is a topic with growing interest.
Current shortcomings such as hallucinations and lack of clear source citations make them unreliable for use in rapidly growing clinical environment.
This study evaluates Ask Avo-derived software by AvoMD that incorporates a proprietary Model Augmented Language Retrieval system.
arXiv Detail & Related papers (2024-09-06T17:53:29Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - A Continued Pretrained LLM Approach for Automatic Medical Note Generation [10.981182525560751]
We introduce HEAL, the first continuously trained 13B LLaMA2-based LLM that is purpose-built for medical conversations and measured on automated scribing.
Our results demonstrate that HEAL outperforms GPT-4 and PMC-LLaMA in PubMedQA, with an accuracy of 78.4%.
Remarkably, HEAL surpasses GPT-4 and Med-PaLM 2 in identifying more correct medical concepts and exceeds the performance of human scribes in correctness and completeness.
arXiv Detail & Related papers (2024-03-14T02:55:37Z) - MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
Large language models (LLMs) can potentially democratize access to medical knowledge.
We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
arXiv Detail & Related papers (2023-11-27T18:49:43Z) - Evaluating multiple large language models in pediatric ophthalmology [37.16480878552708]
The response effectiveness of different large language models (LLMs) and various individuals in pediatric ophthalmology consultations has not been clearly established yet.
This survey evaluated the performance of LLMs in highly specialized scenarios and compare them with the performance of medical students and physicians at different levels.
arXiv Detail & Related papers (2023-11-07T22:23:51Z) - A Comparative Study of Open-Source Large Language Models, GPT-4 and
Claude 2: Multiple-Choice Test Taking in Nephrology [0.6213359027997152]
The study was conducted to evaluate the ability of LLM models to provide correct answers to nephSAP multiple-choice questions.
The findings of this study potentially have significant implications for the future medical training and patient care.
arXiv Detail & Related papers (2023-08-09T05:01:28Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z) - Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks.
We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.