Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots
in Ophthalmology and LLM-based evaluation using GPT-4
- URL: http://arxiv.org/abs/2402.10083v1
- Date: Thu, 15 Feb 2024 16:43:41 GMT
- Title: Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots
in Ophthalmology and LLM-based evaluation using GPT-4
- Authors: Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua
Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei
Ting
- Abstract summary: 400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions.
We find-tuned 5 different LLMs, including LLAMA2-7b, LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat.
A customized clinical evaluation was used to guide GPT-4 evaluation, grounded on clinical accuracy, relevance, patient safety, and ease of understanding.
- Score: 2.3715885775680925
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Purpose: To assess the alignment of GPT-4-based evaluation to human clinician
experts, for the evaluation of responses to ophthalmology-related patient
queries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology
questions and paired answers were created by ophthalmologists to represent
commonly asked patient questions, divided into fine-tuning (368; 92%), and
testing (40; 8%). We find-tuned 5 different LLMs, including LLAMA2-7b,
LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat. For the testing dataset,
additional 8 glaucoma QnA pairs were included. 200 responses to the testing
dataset were generated by 5 fine-tuned LLMs for evaluation. A customized
clinical evaluation rubric was used to guide GPT-4 evaluation, grounded on
clinical accuracy, relevance, patient safety, and ease of understanding. GPT-4
evaluation was then compared against ranking by 5 clinicians for clinical
alignment. Results: Among all fine-tuned LLMs, GPT-3.5 scored the highest
(87.1%), followed by LLAMA2-13b (80.9%), LLAMA2-13b-chat (75.5%),
LLAMA2-7b-Chat (70%) and LLAMA2-7b (68.8%) based on the GPT-4 evaluation. GPT-4
evaluation demonstrated significant agreement with human clinician rankings,
with Spearman and Kendall Tau correlation coefficients of 0.90 and 0.80
respectively; while correlation based on Cohen Kappa was more modest at 0.50.
Notably, qualitative analysis and the glaucoma sub-analysis revealed clinical
inaccuracies in the LLM-generated responses, which were appropriately
identified by the GPT-4 evaluation. Conclusion: The notable clinical alignment
of GPT-4 evaluation highlighted its potential to streamline the clinical
evaluation of LLM chatbot responses to healthcare-related queries. By
complementing the existing clinician-dependent manual grading, this efficient
and automated evaluation could assist the validation of future developments in
LLM applications for healthcare.
Related papers
- oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness [4.118721833273984]
Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge.
Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare.
This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions.
arXiv Detail & Related papers (2024-10-11T00:34:20Z) - Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model [25.384237687766024]
We introduce an open-source, specialized LLM for ophthalmology, termed Language Enhanced Model for Eye (LEME)
LEME was initially pre-trained on the Llama2 70B framework and further fine-tuned with a corpus of 127,000 non-copyrighted training instances.
We benchmarked LEME against eight other LLMs, namely, GPT-3.5, GPT-4, three Llama2 models (7B, 13B, 70B), PMC-LLAMA 13B, Meditron 70B, and EYE-Llama.
arXiv Detail & Related papers (2024-10-01T02:43:54Z) - Evaluating the Impact of a Specialized LLM on Physician Experience in Clinical Decision Support: A Comparison of Ask Avo and ChatGPT-4 [0.3999851878220878]
Large language models (LLMs) to augment clinical decision support systems is a topic with growing interest.
Current shortcomings such as hallucinations and lack of clear source citations make them unreliable for use in rapidly growing clinical environment.
This study evaluates Ask Avo-derived software by AvoMD that incorporates a proprietary Model Augmented Language Retrieval system.
arXiv Detail & Related papers (2024-09-06T17:53:29Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z) - Improving Patient Pre-screening for Clinical Trials: Assisting
Physicians with Large Language Models [0.0]
Large Language Models (LLMs) have shown to perform well for clinical information extraction and clinical reasoning.
This paper investigates the use of InstructGPT to assist physicians in determining eligibility for clinical trials based on a patient's summarised medical profile.
arXiv Detail & Related papers (2023-04-14T21:19:46Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.