An Empirical Evaluation of Large Language Models on Consumer Health Questions
- URL: http://arxiv.org/abs/2501.00208v1
- Date: Tue, 31 Dec 2024 01:08:15 GMT
- Title: An Empirical Evaluation of Large Language Models on Consumer Health Questions
- Authors: Moaiz Abrar, Yusuf Sermet, Ibrahim Demir,
- Abstract summary: This study evaluates the performance of several Large Language Models (LLMs) on MedRedQA, a dataset of consumer-based medical questions and answers.<n> GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored lowest according to three out of five models' judges.
- Score: 0.30723404270319693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study evaluates the performance of several Large Language Models (LLMs) on MedRedQA, a dataset of consumer-based medical questions and answers by verified experts extracted from the AskDocs subreddit. While LLMs have shown proficiency in clinical question answering (QA) benchmarks, their effectiveness on real-world, consumer-based, medical questions remains less understood. MedRedQA presents unique challenges, such as informal language and the need for precise responses suited to non-specialist queries. To assess model performance, responses were generated using five LLMs: GPT-4o mini, Llama 3.1: 70B, Mistral-123B, Mistral-7B, and Gemini-Flash. A cross-evaluation method was used, where each model evaluated its responses as well as those of others to minimize bias. The results indicated that GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored lowest according to three out of five models' judges. This study highlights the potential and limitations of current LLMs for consumer health medical question answering, indicating avenues for further development.
Related papers
- Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions [16.21971764311474]
We evaluate large language models (LLMs) on cancer-related questions drawn from real patients.
LLMs frequently fail to recognize or address false presuppositions in the questions.
These findings expose a critical gap in the clinical reliability of LLMs.
arXiv Detail & Related papers (2025-04-15T16:37:32Z) - Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA)
In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z) - MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research [57.61445960384384]
MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities.
Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%.
Expert analysis of chain-of-thought responses shows perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors.
arXiv Detail & Related papers (2025-03-17T17:33:10Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.
We propose a novel approach utilizing structured medical reasoning.
Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.
We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.
Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - A Benchmark for Long-Form Medical Question Answering [4.815957808858573]
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA)
Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions.
In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors.
arXiv Detail & Related papers (2024-11-14T22:54:38Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Answering real-world clinical questions using large language model based systems [2.2605659089865355]
Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD)
We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability.
arXiv Detail & Related papers (2024-06-29T22:39:20Z) - MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled.
In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.
We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z) - A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [57.88111980149541]
We introduce Asclepius, a novel Med-MLLM benchmark that assesses Med-MLLMs in terms of distinct medical specialties and different diagnostic capacities.<n>Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties.<n>We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.