Related papers: HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models

HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models

URL: http://arxiv.org/abs/2506.21578v1
Date: Mon, 16 Jun 2025 07:40:25 GMT
Title: HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models
Authors: Andrew Maranhão Ventura D'addario,
Abstract summary: HealthQA-BR is the first large-scale, system-wide benchmark for Portuguese-speaking healthcare.<n>It uniquely assesses knowledge not only in medicine and its specialties but also in nursing, dentistry, psychology, social work, and other allied health professions.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evaluation of Large Language Models (LLMs) in healthcare has been dominated by physician-centric, English-language benchmarks, creating a dangerous illusion of competence that ignores the interprofessional nature of patient care. To provide a more holistic and realistic assessment, we introduce HealthQA-BR, the first large-scale, system-wide benchmark for Portuguese-speaking healthcare. Comprising 5,632 questions from Brazil's national licensing and residency exams, it uniquely assesses knowledge not only in medicine and its specialties but also in nursing, dentistry, psychology, social work, and other allied health professions. We conducted a rigorous zero-shot evaluation of over 20 leading LLMs. Our results reveal that while state-of-the-art models like GPT 4.1 achieve high overall accuracy (86.6%), this top-line score masks alarming, previously unmeasured deficiencies. A granular analysis shows performance plummets from near-perfect in specialties like Ophthalmology (98.7%) to barely passing in Neurosurgery (60.0%) and, most notably, Social Work (68.4%). This "spiky" knowledge profile is a systemic issue observed across all models, demonstrating that high-level scores are insufficient for safety validation. By publicly releasing HealthQA-BR and our evaluation suite, we provide a crucial tool to move beyond single-score evaluations and toward a more honest, granular audit of AI readiness for the entire healthcare team.

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications [27.73095565539546]
We introduce a Medical LLM Benchmark MLB, a benchmark evaluating Large Language Models (LLMs) on both foundational knowledge and scenario-based reasoning.<n>MLB is structured around five core dimensions: Medical Knowledge (MedKQA), Safety and Ethics (MedSE), Medical Record Understanding (MedRU), Smart Services (SmartServ), and Smart Healthcare (SmartCare)<n>Its design features a rigorous curation pipeline involving 300 licensed physicians. Besides, we provide a scalable evaluation methodology, centered on a specialized judge model trained via Supervised Fine-Tuning (SFT) on expert annotations.
arXiv Detail & Related papers (2026-01-08T02:41:42Z)
A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care [5.167350493769989]
This is the first evaluation of an LLM-based medication safety review system on real NHS primary care data.<n>We strategically sampled patients to capture a broad range of clinical complexity and medication safety risk.<n>Our primary LLM system showed strong performance in recognising when a clinical issue is present.
arXiv Detail & Related papers (2025-12-24T11:58:49Z)
Generalist Foundation Models Are Not Clinical Enough for Hospital Operations [29.539795338917983]
We introduce Lang1, a family of models pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet.<n>To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes.<n>Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively.
arXiv Detail & Related papers (2025-11-17T18:52:22Z)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z)
Arabic Large Language Models for Medical Text Generation [0.5483130283061118]
This study proposes an approach that fine-tunes large language models (LLMs) for Arabic medical text generation.<n>The system is designed to assist patients by providing accurate medical advice, diagnoses, drug recommendations, and treatment plans based on user input.
arXiv Detail & Related papers (2025-09-12T09:37:26Z)
Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens [0.609562679184219]
Existing medical LLM benchmarks largely reflect examination syllabi and disease profiles from high income settings.<n>Alama Health QA was developed using a retrieval augmented generation framework anchored on the Kenyan Clinical Practice Guidelines.<n>Alama scored highest for relevance and guideline alignment; PubMedQA lowest for clinical utility.
arXiv Detail & Related papers (2025-07-22T08:05:30Z)
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
Large language models (LLMs) achieve near-perfect scores on medical exams.<n>These evaluations inadequately reflect complexity and diversity of real-world clinical practice.<n>We introduce MedHELM, an evaluation framework for assessing LLM performance for medical tasks.
arXiv Detail & Related papers (2025-05-26T22:55:49Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare [0.2302001830524133]
Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety. This study introduces new resources designed to promote ethical and precise AI in healthcare.
arXiv Detail & Related papers (2024-10-09T06:00:05Z)
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.70022886795487]
OpenAI's o1 stands out as the first model with a chain-of-thought technique using reinforcement learning strategies. This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality.
arXiv Detail & Related papers (2024-09-23T17:59:43Z)
SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams [32.77551245372691]
Existing benchmarks for evaluating Large Language Models (LLMs) in healthcare predominantly focus on medical doctors. We introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese. EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists.
arXiv Detail & Related papers (2024-06-17T08:40:36Z)
Evaluating Large Language Models for Public Health Classification and Extraction Tasks [0.3545046504280562]
We present evaluations of Large Language Models (LLMs) for public health tasks involving the classification and extraction of free text.<n>We evaluate eleven open-weight LLMs across all tasks using zero-shot in-context learning.<n>We find promising signs that LLMs may be useful tools for public health experts to extract information from a wide variety of free text sources.
arXiv Detail & Related papers (2024-05-23T16:33:18Z)
Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. We then construct six novel datasets and clinical tasks that are complex but common in real-world practice. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings. We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.