Related papers: Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

URL: http://arxiv.org/abs/2402.01693v1
Date: Tue, 23 Jan 2024 22:03:51 GMT
Title: Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study
Authors: Zhe He, Balu Bhasuran, Qiao Jin, Shubo Tian, Karim Hanna, Cindy Shavor, Lisbeth Garcia Arguello, Patrick Murray, Zhiyong Lu
Abstract summary: Large language models (LLMs) have opened a promising avenue for patients to get their questions answered. We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
Score: 5.823006266363981
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Lab results are often confusing and hard to understand. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered. We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with augmentation approaches. We first collected lab test results related question and answer data from Yahoo! Answers and selected 53 QA pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their answers using standard QA similarity-based evaluation metrics including ROUGE, BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge whether a target model has higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. Finally, we performed a manual evaluation with medical experts for all the responses to seven selected questions on the same four aspects. The results of Win Rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all four aspects (relevance, correctness, helpfulness, and safety). However, LLM responses occasionally also suffer from a lack of interpretation in one's medical context, incorrect statements, and lack of references. We find that compared to other three LLMs and human answer from the Q&A website, GPT-4's responses are more accurate, helpful, relevant, and safer. However, there are cases which GPT-4 responses are inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses.

Related papers

Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
An Empirical Evaluation of Large Language Models on Consumer Health Questions [0.30723404270319693]
This study evaluates the performance of several Large Language Models (LLMs) on MedRedQA, a dataset of consumer-based medical questions and answers. GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored lowest according to three out of five models' judges.
arXiv Detail & Related papers (2024-12-31T01:08:15Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses [26.850344968677582]
We propose a method that leverages large language models to evaluate answers to open-ended questions. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
arXiv Detail & Related papers (2024-10-02T05:22:07Z)
LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Answering real-world clinical questions using large language model based systems [2.2605659089865355]
Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD) We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability.
arXiv Detail & Related papers (2024-06-29T22:39:20Z)
MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled. In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners. We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z)
Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering [67.94354589215637]
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. In this paper, we perceive the LLMs' knowledge boundary (KB) with semi-open-ended questions (SoeQ) We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB. Our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers.
arXiv Detail & Related papers (2024-05-23T10:00:14Z)
Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4 [2.3715885775680925]
400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions. We find-tuned 5 different LLMs, including LLAMA2-7b, LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat. A customized clinical evaluation was used to guide GPT-4 evaluation, grounded on clinical accuracy, relevance, patient safety, and ease of understanding.
arXiv Detail & Related papers (2024-02-15T16:43:41Z)
GPT-4's assessment of its performance in a USMLE-based case study [3.2372388230841977]
This study investigates GPT-4's assessment of its performance in healthcare applications. The questionnaire was categorized into two groups-questions with feedback (WF) and questions with no feedback(NF) post-question. Results indicate that feedback influences relative confidence but doesn't consistently increase or decrease it.
arXiv Detail & Related papers (2024-02-15T01:38:50Z)
Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery [17.47170218010073]
Our objective was to determine whether two large language models (LLMs) can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed. Less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm.
arXiv Detail & Related papers (2023-04-26T17:54:28Z)
Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored. This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.