Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study
- URL: http://arxiv.org/abs/2402.01693v1
- Date: Tue, 23 Jan 2024 22:03:51 GMT
- Title: Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study
- Authors: Zhe He, Balu Bhasuran, Qiao Jin, Shubo Tian, Karim Hanna, Cindy
Shavor, Lisbeth Garcia Arguello, Patrick Murray, Zhiyong Lu
- Abstract summary: Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
- Score: 5.823006266363981
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Lab results are often confusing and hard to understand. Large language models
(LLMs) such as ChatGPT have opened a promising avenue for patients to get their
questions answered. We aim to assess the feasibility of using LLMs to generate
relevant, accurate, helpful, and unharmful responses to lab test-related
questions asked by patients and to identify potential issues that can be
mitigated with augmentation approaches. We first collected lab test results
related question and answer data from Yahoo! Answers and selected 53 QA pairs
for this study. Using the LangChain framework and ChatGPT web portal, we
generated responses to the 53 questions from four LLMs including GPT-4, Meta
LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their
answers using standard QA similarity-based evaluation metrics including ROUGE,
BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge
whether a target model has higher quality in terms of relevance, correctness,
helpfulness, and safety than the baseline model. Finally, we performed a manual
evaluation with medical experts for all the responses to seven selected
questions on the same four aspects. The results of Win Rate and medical expert
evaluation both showed that GPT-4's responses achieved better scores than all
the other LLM responses and human responses on all four aspects (relevance,
correctness, helpfulness, and safety). However, LLM responses occasionally also
suffer from a lack of interpretation in one's medical context, incorrect
statements, and lack of references. We find that compared to other three LLMs
and human answer from the Q&A website, GPT-4's responses are more accurate,
helpful, relevant, and safer. However, there are cases which GPT-4 responses
are inaccurate and not individualized. We identified a number of ways to
improve the quality of LLM responses.
Related papers
- Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering [79.63372684264921]
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension.
This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.
arXiv Detail & Related papers (2024-07-16T17:23:16Z) - Answering real-world clinical questions using large language model based systems [2.2605659089865355]
Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD)
We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability.
arXiv Detail & Related papers (2024-06-29T22:39:20Z) - MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled.
In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.
We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [63.41469979867312]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.
Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.
In this study, we introduce LOVA3, an innovative framework named Learning tO Visual Question Answering, Asking and Assessment''
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering [67.94354589215637]
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations.
In this paper, we perceive the LLMs' knowledge boundary (KB) with semi-open-ended questions (SoeQ)
We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB.
Our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers.
arXiv Detail & Related papers (2024-05-23T10:00:14Z) - EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [9.031182965159976]
Large Language Models (LLMs) show promise in efficiently analyzing vast and complex data.
We introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries.
EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
arXiv Detail & Related papers (2024-02-25T09:41:50Z) - Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots
in Ophthalmology and LLM-based evaluation using GPT-4 [2.3715885775680925]
400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions.
We find-tuned 5 different LLMs, including LLAMA2-7b, LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat.
A customized clinical evaluation was used to guide GPT-4 evaluation, grounded on clinical accuracy, relevance, patient safety, and ease of understanding.
arXiv Detail & Related papers (2024-02-15T16:43:41Z) - GPT-4's assessment of its performance in a USMLE-based case study [3.2372388230841977]
This study investigates GPT-4's assessment of its performance in healthcare applications.
The questionnaire was categorized into two groups-questions with feedback (WF) and questions with no feedback(NF) post-question.
Results indicate that feedback influences relative confidence but doesn't consistently increase or decrease it.
arXiv Detail & Related papers (2024-02-15T01:38:50Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Evaluation of GPT-3.5 and GPT-4 for supporting real-world information
needs in healthcare delivery [17.47170218010073]
Our objective was to determine whether two large language models (LLMs) can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner.
For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed.
Less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm.
arXiv Detail & Related papers (2023-04-26T17:54:28Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.