Performance of ChatGPT-3.5 and GPT-4 on the United States Medical
Licensing Examination With and Without Distractions
- URL: http://arxiv.org/abs/2309.08625v1
- Date: Tue, 12 Sep 2023 05:54:45 GMT
- Title: Performance of ChatGPT-3.5 and GPT-4 on the United States Medical
Licensing Examination With and Without Distractions
- Authors: Myriam Safrai and Amos Azaria
- Abstract summary: This study investigates the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT.
We gathered small talk sentences from human participants using the Mechanical Turk platform.
ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations.
- Score: 17.813396230160095
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As Large Language Models (LLMs) are predictive models building their response
based on the words in the prompts, there is a risk that small talk and
irrelevant information may alter the response and the suggestion given.
Therefore, this study aims to investigate the impact of medical data mixed with
small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3
questions were used as a model for relevant medical data. We use both multiple
choice and open ended questions. We gathered small talk sentences from human
participants using the Mechanical Turk platform. Both sets of USLME questions
were arranged in a pattern where each sentence from the original questions was
followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both
sets of questions with and without the small talk sentences. A board-certified
physician analyzed the answers by ChatGPT and compared them to the formal
correct answer. The analysis results demonstrate that the ability of
ChatGPT-3.5 to answer correctly was impaired when small talk was added to
medical data for multiple-choice questions (72.1\% vs. 68.9\%) and open
questions (61.5\% vs. 44.3\%; p=0.01), respectively. In contrast, small talk
phrases did not impair ChatGPT-4 ability in both types of questions (83.6\% and
66.2\%, respectively). According to these results, ChatGPT-4 seems more
accurate than the earlier 3.5 version, and it appears that small talk does not
impair its capability to provide medical recommendations. Our results are an
important first step in understanding the potential and limitations of
utilizing ChatGPT and other LLMs for physician-patient interactions, which
include casual conversations.
Related papers
- Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - Can ChatGPT be Your Personal Medical Assistant? [0.09264362806173355]
This study uses publicly available online questions and answering datasets in Arabic language.
There are almost 430K questions and answers for 20 disease-specific categories.
The performance of this fine-tuned model was evaluated through automated and human evaluation.
arXiv Detail & Related papers (2023-12-19T09:54:27Z) - Evaluating ChatGPT as a Question Answering System: A Comprehensive
Analysis and Comparison with Existing Models [0.0]
This article scrutinizes ChatGPT as a Question Answering System (QAS)
The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs.
The evaluation highlights hallucinations, where ChatGPT provides responses to questions without available answers in the provided context.
arXiv Detail & Related papers (2023-12-11T08:49:18Z) - Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer.
We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z) - Performance of ChatGPT on USMLE: Unlocking the Potential of Large
Language Models for AI-Assisted Medical Education [0.0]
This study determined how reliable ChatGPT can be for answering complex medical and clinical questions.
The paper evaluated the obtained results using a 2-way ANOVA and posthoc analysis.
ChatGPT-generated answers were found to be more context-oriented than regular Google search results.
arXiv Detail & Related papers (2023-06-30T19:53:23Z) - Chatbots put to the test in math and logic problems: A preliminary
comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard [68.8204255655161]
We use 30 questions that are clear, without any ambiguities, fully described with plain text only, and have a unique, well defined correct answer.
The answers are recorded and discussed, highlighting their strengths and weaknesses.
It was found that ChatGPT-4 outperforms ChatGPT-3.5 in both sets of questions.
arXiv Detail & Related papers (2023-05-30T11:18:05Z) - Does ChatGPT have Theory of Mind? [2.3129337924262927]
Theory of Mind (ToM) is the ability to understand human thinking and decision-making.
This paper investigates what extent recent Large Language Models in the ChatGPT tradition possess ToM.
arXiv Detail & Related papers (2023-05-23T12:55:21Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks.
We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.
ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.