Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study
- URL: http://arxiv.org/abs/2402.01693v1
- Date: Tue, 23 Jan 2024 22:03:51 GMT
- Title: Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study
- Authors: Zhe He, Balu Bhasuran, Qiao Jin, Shubo Tian, Karim Hanna, Cindy
Shavor, Lisbeth Garcia Arguello, Patrick Murray, Zhiyong Lu
- Abstract summary: Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
- Score: 5.823006266363981
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Lab results are often confusing and hard to understand. Large language models
(LLMs) such as ChatGPT have opened a promising avenue for patients to get their
questions answered. We aim to assess the feasibility of using LLMs to generate
relevant, accurate, helpful, and unharmful responses to lab test-related
questions asked by patients and to identify potential issues that can be
mitigated with augmentation approaches. We first collected lab test results
related question and answer data from Yahoo! Answers and selected 53 QA pairs
for this study. Using the LangChain framework and ChatGPT web portal, we
generated responses to the 53 questions from four LLMs including GPT-4, Meta
LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their
answers using standard QA similarity-based evaluation metrics including ROUGE,
BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge
whether a target model has higher quality in terms of relevance, correctness,
helpfulness, and safety than the baseline model. Finally, we performed a manual
evaluation with medical experts for all the responses to seven selected
questions on the same four aspects. The results of Win Rate and medical expert
evaluation both showed that GPT-4's responses achieved better scores than all
the other LLM responses and human responses on all four aspects (relevance,
correctness, helpfulness, and safety). However, LLM responses occasionally also
suffer from a lack of interpretation in one's medical context, incorrect
statements, and lack of references. We find that compared to other three LLMs
and human answer from the Q&A website, GPT-4's responses are more accurate,
helpful, relevant, and safer. However, there are cases which GPT-4 responses
are inaccurate and not individualized. We identified a number of ways to
improve the quality of LLM responses.
Related papers
- LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering.
We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z) - AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses [26.850344968677582]
We propose a method that leverages large language models to evaluate answers to open-ended questions.
We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4.
Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
arXiv Detail & Related papers (2024-10-02T05:22:07Z) - LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Answering real-world clinical questions using large language model based systems [2.2605659089865355]
Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD)
We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability.
arXiv Detail & Related papers (2024-06-29T22:39:20Z) - MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled.
In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.
We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z) - Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering [67.94354589215637]
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations.
In this paper, we perceive the LLMs' knowledge boundary (KB) with semi-open-ended questions (SoeQ)
We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB.
Our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers.
arXiv Detail & Related papers (2024-05-23T10:00:14Z) - Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots
in Ophthalmology and LLM-based evaluation using GPT-4 [2.3715885775680925]
400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions.
We find-tuned 5 different LLMs, including LLAMA2-7b, LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat.
A customized clinical evaluation was used to guide GPT-4 evaluation, grounded on clinical accuracy, relevance, patient safety, and ease of understanding.
arXiv Detail & Related papers (2024-02-15T16:43:41Z) - GPT-4's assessment of its performance in a USMLE-based case study [3.2372388230841977]
This study investigates GPT-4's assessment of its performance in healthcare applications.
The questionnaire was categorized into two groups-questions with feedback (WF) and questions with no feedback(NF) post-question.
Results indicate that feedback influences relative confidence but doesn't consistently increase or decrease it.
arXiv Detail & Related papers (2024-02-15T01:38:50Z) - Evaluation of GPT-3.5 and GPT-4 for supporting real-world information
needs in healthcare delivery [17.47170218010073]
Our objective was to determine whether two large language models (LLMs) can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner.
For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed.
Less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm.
arXiv Detail & Related papers (2023-04-26T17:54:28Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.