Evaluation of GPT-3.5 and GPT-4 for supporting real-world information
needs in healthcare delivery
- URL: http://arxiv.org/abs/2304.13714v3
- Date: Mon, 1 May 2023 00:41:37 GMT
- Title: Evaluation of GPT-3.5 and GPT-4 for supporting real-world information
needs in healthcare delivery
- Authors: Debadutta Dash, Rahul Thapa, Juan M. Banda, Akshay Swaminathan, Morgan
Cheatham, Mehr Kashyap, Nikesh Kotecha, Jonathan H. Chen, Saurabh Gombar,
Lance Downing, Rachel Pedreira, Ethan Goh, Angel Arnaout, Garret Kenn Morris,
Honor Magon, Matthew P Lungren, Eric Horvitz, Nigam H. Shah
- Abstract summary: Our objective was to determine whether two large language models (LLMs) can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner.
For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed.
Less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm.
- Score: 17.47170218010073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite growing interest in using large language models (LLMs) in healthcare,
current explorations do not assess the real-world utility and safety of LLMs in
clinical settings. Our objective was to determine whether two LLMs can serve
information needs submitted by physicians as questions to an informatics
consultation service in a safe and concordant manner. Sixty six questions from
an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple
prompts. 12 physicians assessed the LLM responses' possibility of patient harm
and concordance with existing reports from an informatics consultation service.
Physician assessments were summarized based on majority vote. For no questions
did a majority of physicians deem either LLM response as harmful. For GPT-3.5,
responses to 8 questions were concordant with the informatics consult report,
20 discordant, and 9 were unable to be assessed. There were 29 responses with
no majority on "Agree", "Disagree", and "Unable to assess". For GPT-4,
responses to 13 questions were concordant, 15 discordant, and 3 were unable to
be assessed. There were 35 responses with no majority. Responses from both LLMs
were largely devoid of overt harm, but less than 20% of the responses agreed
with an answer from an informatics consultation service, responses contained
hallucinated references, and physicians were divided on what constitutes harm.
These results suggest that while general purpose LLMs are able to provide safe
and credible responses, they often do not meet the specific information need of
a given question. A definitive evaluation of the usefulness of LLMs in
healthcare settings will likely require additional research on prompt
engineering, calibration, and custom-tailoring of general purpose models.
Related papers
- SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - Addressing cognitive bias in medical language models [25.58126133789956]
BiasMedQA is a benchmark for evaluating cognitive biases in large language models (LLMs) applied to medical tasks.
We tested six models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3.
GPT-4 stood out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias.
arXiv Detail & Related papers (2024-02-12T23:08:37Z) - How well do LLMs cite relevant medical references? An evaluation
framework and analyses [18.1921791355309]
Large language models (LLMs) are currently being used to answer medical questions across a variety of clinical domains.
In this paper, we ask: do the sources that LLMs generate actually support the claims that they make?
We demonstrate that GPT-4 is highly accurate in validating source relevance, agreeing 88% of the time with a panel of medical doctors.
arXiv Detail & Related papers (2024-02-03T03:44:57Z) - Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - Evaluating multiple large language models in pediatric ophthalmology [37.16480878552708]
The response effectiveness of different large language models (LLMs) and various individuals in pediatric ophthalmology consultations has not been clearly established yet.
This survey evaluated the performance of LLMs in highly specialized scenarios and compare them with the performance of medical students and physicians at different levels.
arXiv Detail & Related papers (2023-11-07T22:23:51Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Augmenting Black-box LLMs with Medical Textbooks for Clinical Question
Answering [54.13933019557655]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - Challenges of GPT-3-based Conversational Agents for Healthcare [11.517862889784293]
This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA)
We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems.
Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.
arXiv Detail & Related papers (2023-08-28T15:12:34Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.