An Empirical Evaluation of Large Language Models on Consumer Health Questions
- URL: http://arxiv.org/abs/2501.00208v1
- Date: Tue, 31 Dec 2024 01:08:15 GMT
- Title: An Empirical Evaluation of Large Language Models on Consumer Health Questions
- Authors: Moaiz Abrar, Yusuf Sermet, Ibrahim Demir,
- Abstract summary: This study evaluates the performance of several Large Language Models (LLMs) on MedRedQA, a dataset of consumer-based medical questions and answers.
GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored lowest according to three out of five models' judges.
- Score: 0.30723404270319693
- License:
- Abstract: This study evaluates the performance of several Large Language Models (LLMs) on MedRedQA, a dataset of consumer-based medical questions and answers by verified experts extracted from the AskDocs subreddit. While LLMs have shown proficiency in clinical question answering (QA) benchmarks, their effectiveness on real-world, consumer-based, medical questions remains less understood. MedRedQA presents unique challenges, such as informal language and the need for precise responses suited to non-specialist queries. To assess model performance, responses were generated using five LLMs: GPT-4o mini, Llama 3.1: 70B, Mistral-123B, Mistral-7B, and Gemini-Flash. A cross-evaluation method was used, where each model evaluated its responses as well as those of others to minimize bias. The results indicated that GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored lowest according to three out of five models' judges. This study highlights the potential and limitations of current LLMs for consumer health medical question answering, indicating avenues for further development.
Related papers
- LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.
We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.
Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - A Benchmark for Long-Form Medical Question Answering [4.815957808858573]
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA)
Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions.
In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors.
arXiv Detail & Related papers (2024-11-14T22:54:38Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Answering real-world clinical questions using large language model based systems [2.2605659089865355]
Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD)
We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability.
arXiv Detail & Related papers (2024-06-29T22:39:20Z) - MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled.
In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.
We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z) - Assessing The Potential Of Mid-Sized Language Models For Clinical QA [24.116649037975762]
Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks.
Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks.
This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.
arXiv Detail & Related papers (2024-04-24T14:32:34Z) - Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.