Related papers: RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering

RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering

URL: http://arxiv.org/abs/2509.16360v2
Date: Fri, 03 Oct 2025 00:51:44 GMT
Title: RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering
Authors: Weikang Qiu, Tinglin Huang, Ryan Rullo, Yucheng Kuang, Ali Maatouk, S. Raquel Ramos, Rex Ying,
Abstract summary: Large Language Models (LLMs) hold promise in addressing complex medical problems.<n>A significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses.<n>We introduce RephQA, a benchmark for evaluating the readability of LLMs in public health question answering (QA)
Score: 22.172697706271535
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) hold promise in addressing complex medical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses, specifically, their ability to answer public health problems clearly and simply to people without medical backgrounds. In this work, we introduce RephQA, a benchmark for evaluating the readability of LLMs in public health question answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across 13 topics, and includes a proxy multiple-choice task to assess informativeness, along with two readability metrics: Flesch-Kincaid grade level and professional score. Evaluation of 25 LLMs reveals that most fail to meet readability standards, highlighting a gap between reasoning and effective communication. To address this, we explore four readability-enhancing strategies-standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best results, advancing the development of more practical and user-friendly public health agents. These results represent a step toward building more practical agents for public health.

Related papers

From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas [1.8594711725515678]
We introduce textbfGlobalHealthAtlas, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages.<n>We propose large language model (LLM) assisted construction and quality control pipeline with retrieval, duplication, evidence grounding checks, and label validation to improve consistency at scale.
arXiv Detail & Related papers (2026-01-31T03:29:30Z)
Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs [0.0]
We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition.<n>We fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets.<n>Human evaluation shows that over 80% of generated summaries preserve critical medical information.
arXiv Detail & Related papers (2025-11-13T19:42:11Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases [7.894865736540358]
Large Language Models (LLMs) are used in high-stakes applications such as medical (self-diagnosis) and preliminary triage.<n>This paper presents the findings from a university-level competition that leveraged a novel, crowdsourced approach for evaluating the effectiveness of LLMs.
arXiv Detail & Related papers (2025-06-13T17:12:47Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
HealthQ: Unveiling Questioning Capabilities of LLM Chains in Healthcare Conversations [20.31796453890812]
HealthQ is a framework for evaluating the questioning capabilities of large language models (LLMs) in healthcare conversations.<n>We integrate an LLM judge to evaluate generated questions across metrics such as specificity, relevance, and usefulness.<n>We present the first systematic framework for assessing questioning capabilities in healthcare conversations, establish a model-agnostic evaluation methodology, and provide empirical evidence linking high-quality questions to improved patient information elicitation.
arXiv Detail & Related papers (2024-09-28T23:59:46Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering [89.76059961309453]
HeadQA dataset contains multiple-choice questions authorized for the public healthcare specialization exam. These questions are the most challenging for current QA systems. We present a Multi-step reasoning with Knowledge extraction framework (MurKe) We are striving to make full use of off-the-shelf pre-trained models.
arXiv Detail & Related papers (2020-08-06T02:47:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.