Evaluation of GPT-3.5 and GPT-4 for supporting real-world information
needs in healthcare delivery
- URL: http://arxiv.org/abs/2304.13714v3
- Date: Mon, 1 May 2023 00:41:37 GMT
- Title: Evaluation of GPT-3.5 and GPT-4 for supporting real-world information
needs in healthcare delivery
- Authors: Debadutta Dash, Rahul Thapa, Juan M. Banda, Akshay Swaminathan, Morgan
Cheatham, Mehr Kashyap, Nikesh Kotecha, Jonathan H. Chen, Saurabh Gombar,
Lance Downing, Rachel Pedreira, Ethan Goh, Angel Arnaout, Garret Kenn Morris,
Honor Magon, Matthew P Lungren, Eric Horvitz, Nigam H. Shah
- Abstract summary: Our objective was to determine whether two large language models (LLMs) can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner.
For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed.
Less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm.
- Score: 17.47170218010073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite growing interest in using large language models (LLMs) in healthcare,
current explorations do not assess the real-world utility and safety of LLMs in
clinical settings. Our objective was to determine whether two LLMs can serve
information needs submitted by physicians as questions to an informatics
consultation service in a safe and concordant manner. Sixty six questions from
an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple
prompts. 12 physicians assessed the LLM responses' possibility of patient harm
and concordance with existing reports from an informatics consultation service.
Physician assessments were summarized based on majority vote. For no questions
did a majority of physicians deem either LLM response as harmful. For GPT-3.5,
responses to 8 questions were concordant with the informatics consult report,
20 discordant, and 9 were unable to be assessed. There were 29 responses with
no majority on "Agree", "Disagree", and "Unable to assess". For GPT-4,
responses to 13 questions were concordant, 15 discordant, and 3 were unable to
be assessed. There were 35 responses with no majority. Responses from both LLMs
were largely devoid of overt harm, but less than 20% of the responses agreed
with an answer from an informatics consultation service, responses contained
hallucinated references, and physicians were divided on what constitutes harm.
These results suggest that while general purpose LLMs are able to provide safe
and credible responses, they often do not meet the specific information need of
a given question. A definitive evaluation of the usefulness of LLMs in
healthcare settings will likely require additional research on prompt
engineering, calibration, and custom-tailoring of general purpose models.
Related papers
- Usefulness of LLMs as an Author Checklist Assistant for Scientific Papers: NeurIPS'24 Experiment [59.09144776166979]
Large language models (LLMs) represent a promising, but controversial, tool in aiding scientific peer review.
This study evaluates the usefulness of LLMs in a conference setting as a tool for vetting paper submissions against submission standards.
arXiv Detail & Related papers (2024-11-05T18:58:00Z) - Evaluating the Impact of a Specialized LLM on Physician Experience in Clinical Decision Support: A Comparison of Ask Avo and ChatGPT-4 [0.3999851878220878]
Large language models (LLMs) to augment clinical decision support systems is a topic with growing interest.
Current shortcomings such as hallucinations and lack of clear source citations make them unreliable for use in rapidly growing clinical environment.
This study evaluates Ask Avo-derived software by AvoMD that incorporates a proprietary Model Augmented Language Retrieval system.
arXiv Detail & Related papers (2024-09-06T17:53:29Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Addressing cognitive bias in medical language models [25.58126133789956]
BiasMedQA is a benchmark for evaluating cognitive biases in large language models (LLMs) applied to medical tasks.
We tested six models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3.
GPT-4 stood out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias.
arXiv Detail & Related papers (2024-02-12T23:08:37Z) - How well do LLMs cite relevant medical references? An evaluation
framework and analyses [18.1921791355309]
Large language models (LLMs) are currently being used to answer medical questions across a variety of clinical domains.
In this paper, we ask: do the sources that LLMs generate actually support the claims that they make?
We demonstrate that GPT-4 is highly accurate in validating source relevance, agreeing 88% of the time with a panel of medical doctors.
arXiv Detail & Related papers (2024-02-03T03:44:57Z) - Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - Challenges of GPT-3-based Conversational Agents for Healthcare [11.517862889784293]
This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA)
We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems.
Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.
arXiv Detail & Related papers (2023-08-28T15:12:34Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.