Related papers: Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment

Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment

URL: http://arxiv.org/abs/2603.00917v2
Date: Wed, 04 Mar 2026 06:10:32 GMT
Title: Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
Authors: Shravani Hariprasad,
Abstract summary: Small open-source language models are gaining attention for healthcare applications in low-resource settings.<n>We evaluate five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B) across three clinical question answering datasets.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Small open-source language models are gaining attention for healthcare applications in low-resource settings where cloud infrastructure and GPU hardware may be unavailable. However, their reliability under different prompt phrasings remains poorly understood. We evaluate five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B, a domain-pretrained model without instruction tuning) across three clinical question answering datasets (MedQA, MedMCQA, and PubMedQA) using five prompt styles: original, formal, simplified, roleplay, and direct. Model behavior is evaluated using consistency scores, accuracy, and instruction-following failure rates. All experiments were conducted locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent across models. Gemma 2 achieved the highest consistency (0.845-0.888) but the lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) alongside the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), indicating that domain pretraining alone is insufficient for structured clinical QA. These findings show that high consistency does not imply correctness: models can be reliably wrong, a dangerous failure mode in clinical AI. Llama 3.2 demonstrated the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages [0.0]
This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs)<n>Models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance.<n>A bilingual analysis of Aya-expanse-8B revealed that translating Persian transcripts to English improved sensitivity, reduced missing outputs, and boosted metrics robust to class imbalance.
arXiv Detail & Related papers (2026-02-24T21:10:29Z)
Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology [34.80893325510028]
Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility.<n>We use coupled to human evaluation and clinical review to assess six small open-source medical LLMs.
arXiv Detail & Related papers (2025-12-26T14:30:53Z)
RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z)
Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations [19.488236277427358]
Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process.<n>We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications.
arXiv Detail & Related papers (2025-10-13T09:28:22Z)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z)
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine [69.08855631283829]
We introduce Med Omni-45 Degrees, a benchmark designed to quantify safety-performance trade-offs under manipulative hint conditions.<n>It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA.<n>Results show a consistent safety-performance trade-off, with no model surpassing the diagonal.
arXiv Detail & Related papers (2025-08-22T08:38:16Z)
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [54.30630356786752]
ReasonMed is the largest medical reasoning dataset to date, with 370k high-quality examples.<n>It is built through a multi-agent generation, verification, and refinement process.<n>Using ReasonMed, we find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results.
arXiv Detail & Related papers (2025-06-11T08:36:55Z)
Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization [0.06554326244334867]
This paper introduces Preferred-MedLLM-Qwen-72B, a 72B- parameter model optimized for the Japanese medical domain.<n>We employ a two-stage fine-tuning process on the Qwen2.5-72B base model to achieve both high accuracy and stable reasoning.
arXiv Detail & Related papers (2025-04-25T05:15:31Z)
Can Reasoning LLMs Enhance Clinical Document Classification? [7.026393789313748]
Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task.<n>This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat)<n>Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%)
arXiv Detail & Related papers (2025-04-10T18:00:27Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model [1.7064514726335305]
We analyzed 9,683 Hebrew radiology reports from Crohn's disease patients.<n>We incorporated uncertainty-aware prompt ensembles and an agent-based decision model.
arXiv Detail & Related papers (2025-02-02T16:57:03Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.